[RFC PATCH 0/5] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v3

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/5] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v3
@ 2011-11-18 16:58 Mel Gorman
  2011-11-18 16:58 ` [PATCH 1/5] mm: compaction: Allow compaction to isolate dirty pages Mel Gorman
                   ` (13 more replies)
  0 siblings, 14 replies; 40+ messages in thread
From: Mel Gorman @ 2011-11-18 16:58 UTC (permalink / raw)
  To: Linux-MM
  Cc: Andrea Arcangeli, Minchan Kim, Jan Kara, Andy Isaacson,
	Johannes Weiner, Mel Gorman, LKML

This series is against 3.2-rc2 and follows on from discussions on "mm:
Do not stall in synchronous compaction for THP allocations". That
patch eliminated stalls due to compaction which sometimes resulted
in user-visible interactivity problems on browsers. The downside was
that THP success allocation rates were lower because dirty pages were
not being migrated.

This series is an RFC on how we can migrate more dirty pages with
asynchronous compaction.  The intention is to maximise transparent
hugepage availability while minimising stalls. This does not rule out
the possibility of having a tunable to enable synchronous compaction
at a time when a user is willing to stall waiting on huge pages and
to disable khugepaged.

Patch 1 partially reverts commit 39deaf85 to allow migration to isolate
	dirty pages.

Patch 2 notes that the /proc/sys/vm/compact_memory handler is not using
	synchronous compaction when it should be.

Patch 3 prevents THP allocations using synchronous compaction as this
	can result in user-visible stalls. More details on the stalls
	are in the changelog.

Patch 4 adds a sync parameter to the migratepage callback. It is up
	to the callback to migrate that page without blocking if
	sync==false. For example, fallback_migrate_page will not
	call writepage if sync==false

Patch 5 restores filter-awareness to isolate_lru_page for migration.
	In practice, it means that pages under writeback and pages
	without a ->migratepage callback will not be isolated
	for migration.

This has been lightly tested and nothing horrible fell out
but I need to think a lot more more on patch 4 to see if
buffer_migrate_lock_buffers() is really doing the right thing for
async compaction and if the backout logic is correct. Stalls due
to compaction were eliminated and hugepage allocation success rates
were more or less the same. I'm running a more comprehensive set of
tests over the weekend to see if the warning in patch 4 triggers
in particular and what the allocation success rates look like for
different loads.

Andrea, I didn't pick up your "move ISOLATE_CLEAN setting out of
compaction_migratepages loop" but obviously could if this series gains
any traction. This is also orthogonal to your "improve synchronous
compaction" idea but obviously if the stalls from sync compaction could
be significantly reduced, it would still not collide with this series
that improves the migration of dirty pages for asynchronous compaction.
If your approach works, it would replace patch 3 from this series.

 fs/btrfs/disk-io.c      |    2 +-
 fs/nfs/internal.h       |    2 +-
 fs/nfs/write.c          |    4 +-
 include/linux/fs.h      |    9 +++-
 include/linux/gfp.h     |   11 +++++
 include/linux/migrate.h |    2 +-
 include/linux/mmzone.h  |    2 +
 mm/compaction.c         |    3 +-
 mm/migrate.c            |  106 ++++++++++++++++++++++++++++++++---------------
 mm/page_alloc.c         |    9 ++++-
 mm/vmscan.c             |   36 +++++++++++++++-
 11 files changed, 140 insertions(+), 46 deletions(-)

-- 
1.7.3.4

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/5] mm: compaction: Allow compaction to isolate dirty pages
  2011-11-18 16:58 [RFC PATCH 0/5] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v3 Mel Gorman
@ 2011-11-18 16:58 ` Mel Gorman
  2011-11-18 17:28   ` Andrea Arcangeli
  2011-11-21 17:16   ` Rik van Riel
  2011-11-18 16:58 ` [PATCH 2/5] mm: compaction: Use synchronous compaction for /proc/sys/vm/compact_memory Mel Gorman
                   ` (12 subsequent siblings)
  13 siblings, 2 replies; 40+ messages in thread
From: Mel Gorman @ 2011-11-18 16:58 UTC (permalink / raw)
  To: Linux-MM
  Cc: Andrea Arcangeli, Minchan Kim, Jan Kara, Andy Isaacson,
	Johannes Weiner, Mel Gorman, LKML

Commit [39deaf85: mm: compaction: make isolate_lru_page() filter-aware]
noted that compaction does not migrate dirty or writeback pages and
that is was meaningless to pick the page and re-add it to the LRU list.

What was missed during review is that asynchronous migration moves
dirty pages if their ->migratepage callback is migrate_page() because
these can be moved without blocking. This potentially impacted
hugepage allocation success rates by a factor depending on how many
dirty pages are in the system.

This patch partially reverts 39deaf85 to allow migration to isolate
dirty pages again. This increases how much compaction disrupts the
LRU but that is addressed later in the series.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/compaction.c |    3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 899d956..237560e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -349,9 +349,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 			continue;
 		}

-		if (!cc->sync)
-			mode |= ISOLATE_CLEAN;
-
 		/* Try isolate the page */
 		if (__isolate_lru_page(page, mode, 0) != 0)
 			continue;
-- 
1.7.3.4

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/5] mm: compaction: Use synchronous compaction for /proc/sys/vm/compact_memory
  2011-11-18 16:58 [RFC PATCH 0/5] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v3 Mel Gorman
  2011-11-18 16:58 ` [PATCH 1/5] mm: compaction: Allow compaction to isolate dirty pages Mel Gorman
@ 2011-11-18 16:58 ` Mel Gorman
  2011-11-18 17:27   ` Andrea Arcangeli
  2011-11-21 21:46   ` Rik van Riel
  2011-11-18 16:58 ` [PATCH 3/5] mm: Do not stall in synchronous compaction for THP allocations Mel Gorman
                   ` (11 subsequent siblings)
  13 siblings, 2 replies; 40+ messages in thread
From: Mel Gorman @ 2011-11-18 16:58 UTC (permalink / raw)
  To: Linux-MM
  Cc: Andrea Arcangeli, Minchan Kim, Jan Kara, Andy Isaacson,
	Johannes Weiner, Mel Gorman, LKML

When asynchronous compaction was introduced, the
/proc/sys/vm/compact_memory handler should have been updated to always
use synchronous compaction. This did not happen so this patch addresses
it. The assumption is if a user writes to /proc/sys/vm/compact_memory,
they are willing for that process to stall.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 mm/compaction.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 237560e..615502b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -666,6 +666,7 @@ static int compact_node(int nid)
 			.nr_freepages = 0,
 			.nr_migratepages = 0,
 			.order = -1,
+			.sync = true,
 		};
 
 		zone = &pgdat->node_zones[zoneid];
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 3/5] mm: Do not stall in synchronous compaction for THP allocations
  2011-11-18 16:58 [RFC PATCH 0/5] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v3 Mel Gorman
  2011-11-18 16:58 ` [PATCH 1/5] mm: compaction: Allow compaction to isolate dirty pages Mel Gorman
  2011-11-18 16:58 ` [PATCH 2/5] mm: compaction: Use synchronous compaction for /proc/sys/vm/compact_memory Mel Gorman
@ 2011-11-18 16:58 ` Mel Gorman
  2011-11-18 17:34   ` Andrea Arcangeli
  2011-11-18 16:58 ` [PATCH 4/5] mm: compaction: Determine if dirty pages can be migreated without blocking within ->migratepage Mel Gorman
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 40+ messages in thread
From: Mel Gorman @ 2011-11-18 16:58 UTC (permalink / raw)
  To: Linux-MM
  Cc: Andrea Arcangeli, Minchan Kim, Jan Kara, Andy Isaacson,
	Johannes Weiner, Mel Gorman, LKML

Occasionally during large file copies to slow storage, there are still
reports of user-visible stalls when THP is enabled. Reports on this
have been intermittent and not reliable to reproduce locally but;

Andy Isaacson reported a problem copying to VFAT on SD Card
	https://lkml.org/lkml/2011/11/7/2

	In this case, it was stuck in munmap for betwen 20 and 60
	seconds in compaction. It is also possible that khugepaged
	was holding mmap_sem on this process if CONFIG_NUMA was set.

Johannes Weiner reported stalls on USB
	https://lkml.org/lkml/2011/7/25/378

	In this case, there is no stack trace but it looks like the
	same problem. The USB stick may have been using NTFS as a
	filesystem based on other work done related to writing back
	to USB around the same time.

Internally in SUSE, I received a bug report related to stalls in firefox
	when using Java and Flash heavily while copying from NFS
	to VFAT on USB. It has not been confirmed to be the same problem
	but if it looks like a duck and quacks like a duck.....

In the past, commit [11bc82d6: mm: compaction: Use async migration for
__GFP_NO_KSWAPD and enforce no writeback] forced that sync compaction
would never be used for THP allocations. This was reverted in commit
[c6a140bf: mm/compaction: reverse the change that forbade sync
migraton with __GFP_NO_KSWAPD] on the grounds that it was uncertain
it was beneficial.

While user-visible stalls do not happen for me when writing to USB,
I setup a test running postmark while short-lived processes created
anonymous mapping. The objective was to exercise the paths that
allocate transparent huge pages. I then logged when processes were
stalled for more than 1 second, recorded a stack strace and did some
analysis to aggregate unique "stall events" which revealed

Time stalled in this event:    47369 ms
Event count:                      20
usemem               sleep_on_page          3690 ms
usemem               sleep_on_page          2148 ms
usemem               sleep_on_page          1534 ms
usemem               sleep_on_page          1518 ms
usemem               sleep_on_page          1225 ms
usemem               sleep_on_page          2205 ms
usemem               sleep_on_page          2399 ms
usemem               sleep_on_page          2398 ms
usemem               sleep_on_page          3760 ms
usemem               sleep_on_page          1861 ms
usemem               sleep_on_page          2948 ms
usemem               sleep_on_page          1515 ms
usemem               sleep_on_page          1386 ms
usemem               sleep_on_page          1882 ms
usemem               sleep_on_page          1850 ms
usemem               sleep_on_page          3715 ms
usemem               sleep_on_page          3716 ms
usemem               sleep_on_page          4846 ms
usemem               sleep_on_page          1306 ms
usemem               sleep_on_page          1467 ms
[<ffffffff810ef30c>] wait_on_page_bit+0x6c/0x80
[<ffffffff8113de9f>] unmap_and_move+0x1bf/0x360
[<ffffffff8113e0e2>] migrate_pages+0xa2/0x1b0
[<ffffffff81134273>] compact_zone+0x1f3/0x2f0
[<ffffffff811345d8>] compact_zone_order+0xa8/0xf0
[<ffffffff811346ff>] try_to_compact_pages+0xdf/0x110
[<ffffffff810f773a>] __alloc_pages_direct_compact+0xda/0x1a0
[<ffffffff810f7d5d>] __alloc_pages_slowpath+0x55d/0x7a0
[<ffffffff810f8151>] __alloc_pages_nodemask+0x1b1/0x1c0
[<ffffffff811331db>] alloc_pages_vma+0x9b/0x160
[<ffffffff81142bb0>] do_huge_pmd_anonymous_page+0x160/0x270
[<ffffffff814410a7>] do_page_fault+0x207/0x4c0
[<ffffffff8143dde5>] page_fault+0x25/0x30

The stall times are approximate at best but the estimates represent 25%
of the worst stalls and even if the estimates are off by a factor of
10, it's severe.

This patch once again prevents sync migration for transparent
hugepage allocations as it is preferable to fail a THP allocation
than stall.

It was suggested that __GFP_NORETRY be used instead of __GFP_NO_KSWAPD
to look less like a special case. This would prevent THP allocation
using sync compaction but it would have other side-effects. There are
existing users of __GFP_NORETRY that are doing high-order allocations
and while they can handle allocation failure, it seems reasonable that
they continue to use sync compaction unless there is a deliberate
reason to change that. To help clarify this for the future, this
patch updates the comment for __GFP_NO_KSWAPD.

If accepted, this is a -stable candidate.

Reported-by: Andy Isaacson <adi@hexapodia.org>
Reported-and-tested-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/gfp.h |   11 +++++++++++
 mm/page_alloc.c     |    9 ++++++++-
 2 files changed, 19 insertions(+), 1 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 3a76faf..ef1b1af 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -83,7 +83,18 @@ struct vm_area_struct;
 #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */
 #define __GFP_NOTRACK	((__force gfp_t)___GFP_NOTRACK)  /* Don't track with kmemcheck */
 
+/*
+ * __GFP_NO_KSWAPD indicates that the VM should favour failing the allocation
+ * over excessive disruption of the system. Currently this means
+ * 1. Do not wake kswapd (hence the flag name)
+ * 2. Do not use stall in synchronous compaction for high-order allocations
+ *    as this may cause the caller to stall writing out pages
+ *
+ * This flag it primarily intended for use with transparent hugepage support.
+ * If the flag is used outside the VM, linux-mm should be cc'd for review.
+ */
 #define __GFP_NO_KSWAPD	((__force gfp_t)___GFP_NO_KSWAPD)
+
 #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
 
 /*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9dd443d..7a5c5b1 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2168,7 +2168,14 @@ rebalance:
 					sync_migration);
 	if (page)
 		goto got_pg;
-	sync_migration = true;
+
+	/*
+	 * Do not use sync migration if __GFP_NO_KSWAPD is used to indicate
+	 * the system should not be heavily disrupted. In practice, this is
+	 * to avoid THP callers being stalled in writeback during migration
+	 * as it's preferable for the the allocations to fail than to stall
+	 */
+	sync_migration = !(gfp_mask & __GFP_NO_KSWAPD);
 
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 4/5] mm: compaction: Determine if dirty pages can be migreated without blocking within ->migratepage
  2011-11-18 16:58 [RFC PATCH 0/5] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v3 Mel Gorman
                   ` (2 preceding siblings ...)
  2011-11-18 16:58 ` [PATCH 3/5] mm: Do not stall in synchronous compaction for THP allocations Mel Gorman
@ 2011-11-18 16:58 ` Mel Gorman
  2011-11-18 21:35   ` Andrea Arcangeli
  2011-11-19  8:59   ` Nai Xia
  2011-11-18 16:58 ` [PATCH 5/5] mm: compaction: make isolate_lru_page() filter-aware again Mel Gorman
                   ` (9 subsequent siblings)
  13 siblings, 2 replies; 40+ messages in thread
From: Mel Gorman @ 2011-11-18 16:58 UTC (permalink / raw)
  To: Linux-MM
  Cc: Andrea Arcangeli, Minchan Kim, Jan Kara, Andy Isaacson,
	Johannes Weiner, Mel Gorman, LKML

Asynchronous compaction is when allocating transparent hugepages to
avoid blocking for long periods of time. Due to reports of stalling,
synchronous compaction is never used but this impacts allocation
success rates. When deciding whether to migrate dirty pages, the
following check is made

	if (PageDirty(page) && !sync &&
		mapping->a_ops->migratepage != migrate_page)
			rc = -EBUSY;

This skips over all pages using buffer_migrate_page() even though
it is possible to migrate some of these pages without blocking. This
patch updates the ->migratepage callback with a "sync" parameter. It
is the resposibility of the callback to gracefully fail migration of
the page if it cannot be achieved without blocking.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 fs/btrfs/disk-io.c      |    2 +-
 fs/nfs/internal.h       |    2 +-
 fs/nfs/write.c          |    4 +-
 include/linux/fs.h      |    9 +++-
 include/linux/migrate.h |    2 +-
 mm/migrate.c            |  106 ++++++++++++++++++++++++++++++++---------------
 6 files changed, 83 insertions(+), 42 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 62afe5c..f841f00 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -872,7 +872,7 @@ static int btree_submit_bio_hook(struct inode *inode, int rw, struct bio *bio,
 
 #ifdef CONFIG_MIGRATION
 static int btree_migratepage(struct address_space *mapping,
-			struct page *newpage, struct page *page)
+			struct page *newpage, struct page *page, bool sync)
 {
 	/*
 	 * we can't safely write a btree page from here,
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index c1a1bd8..d0c460f 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -328,7 +328,7 @@ void nfs_commit_release_pages(struct nfs_write_data *data);
 
 #ifdef CONFIG_MIGRATION
 extern int nfs_migrate_page(struct address_space *,
-		struct page *, struct page *);
+		struct page *, struct page *, bool);
 #else
 #define nfs_migrate_page NULL
 #endif
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 1dda78d..33475df 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1711,7 +1711,7 @@ out_error:
 
 #ifdef CONFIG_MIGRATION
 int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
-		struct page *page)
+		struct page *page, bool sync)
 {
 	/*
 	 * If PagePrivate is set, then the page is currently associated with
@@ -1726,7 +1726,7 @@ int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
 
 	nfs_fscache_release_page(page, GFP_KERNEL);
 
-	return migrate_page(mapping, newpage, page);
+	return migrate_page(mapping, newpage, page, sync);
 }
 #endif
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0c4df26..67f8e46 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -609,9 +609,12 @@ struct address_space_operations {
 			loff_t offset, unsigned long nr_segs);
 	int (*get_xip_mem)(struct address_space *, pgoff_t, int,
 						void **, unsigned long *);
-	/* migrate the contents of a page to the specified target */
+	/*
+	 * migrate the contents of a page to the specified target. If sync
+	 * is false, it must not block. If it needs to block, return -EBUSY
+	 */
 	int (*migratepage) (struct address_space *,
-			struct page *, struct page *);
+			struct page *, struct page *, bool);
 	int (*launder_page) (struct page *);
 	int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
 					unsigned long);
@@ -2577,7 +2580,7 @@ extern int generic_check_addressable(unsigned, u64);
 
 #ifdef CONFIG_MIGRATION
 extern int buffer_migrate_page(struct address_space *,
-				struct page *, struct page *);
+				struct page *, struct page *, bool);
 #else
 #define buffer_migrate_page NULL
 #endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index e39aeec..14e6d2a 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -11,7 +11,7 @@ typedef struct page *new_page_t(struct page *, unsigned long private, int **);
 
 extern void putback_lru_pages(struct list_head *l);
 extern int migrate_page(struct address_space *,
-			struct page *, struct page *);
+			struct page *, struct page *, bool);
 extern int migrate_pages(struct list_head *l, new_page_t x,
 			unsigned long private, bool offlining,
 			bool sync);
diff --git a/mm/migrate.c b/mm/migrate.c
index 578e291..8395697 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -415,7 +415,7 @@ EXPORT_SYMBOL(fail_migrate_page);
  * Pages are locked upon entry and exit.
  */
 int migrate_page(struct address_space *mapping,
-		struct page *newpage, struct page *page)
+		struct page *newpage, struct page *page, bool sync)
 {
 	int rc;
 
@@ -432,19 +432,60 @@ int migrate_page(struct address_space *mapping,
 EXPORT_SYMBOL(migrate_page);
 
 #ifdef CONFIG_BLOCK
+
+/* Returns true if all buffers are successfully locked */
+bool buffer_migrate_lock_buffers(struct buffer_head *head, bool sync)
+{
+	struct buffer_head *bh = head;
+
+	/* Simple case, sync compaction */
+	if (sync) {
+		do {
+			get_bh(bh);
+			lock_buffer(bh);
+			bh = bh->b_this_page;
+
+		} while (bh != head);
+
+		return true;
+	}
+
+	/* async case, we cannot block on lock_buffer so use trylock_buffer */
+	do {
+		get_bh(bh);
+		if (!trylock_buffer(bh)) {
+			/*
+			 * We failed to lock the buffer and cannot stall in
+			 * async migration. Release the taken locks
+			 */
+			struct buffer_head *failed_bh = bh;
+			bh = head;
+			do {
+				unlock_buffer(bh);
+				put_bh(bh);
+				bh = bh->b_this_page;
+			} while (bh != failed_bh);
+			return false;
+		}
+
+		bh = bh->b_this_page;
+	} while (bh != head);
+	return true;
+}
+
 /*
  * Migration function for pages with buffers. This function can only be used
  * if the underlying filesystem guarantees that no other references to "page"
  * exist.
  */
 int buffer_migrate_page(struct address_space *mapping,
-		struct page *newpage, struct page *page)
+		struct page *newpage, struct page *page, bool sync)
 {
 	struct buffer_head *bh, *head;
 	int rc;
 
 	if (!page_has_buffers(page))
-		return migrate_page(mapping, newpage, page);
+		return migrate_page(mapping, newpage, page, sync);
 
 	head = page_buffers(page);
 
@@ -453,13 +494,18 @@ int buffer_migrate_page(struct address_space *mapping,
 	if (rc)
 		return rc;
 
-	bh = head;
-	do {
-		get_bh(bh);
-		lock_buffer(bh);
-		bh = bh->b_this_page;
-
-	} while (bh != head);
+	if (!buffer_migrate_lock_buffers(head, sync)) {
+		/*
+		 * We have to revert the radix tree update. If this returns
+		 * non-zero, it either means that the page count changed
+		 * which "can't happen" or the slot changed from underneath
+		 * us in which case someone operated on a page that did not
+		 * have buffers fully migrated which is alarming so warn
+		 * that it happened.
+		 */
+		WARN_ON(migrate_page_move_mapping(mapping, page, newpage));
+		return -EBUSY;
+	}
 
 	ClearPagePrivate(page);
 	set_page_private(newpage, page_private(page));
@@ -536,10 +582,13 @@ static int writeout(struct address_space *mapping, struct page *page)
  * Default handling if a filesystem does not provide a migration function.
  */
 static int fallback_migrate_page(struct address_space *mapping,
-	struct page *newpage, struct page *page)
+	struct page *newpage, struct page *page, bool sync)
 {
-	if (PageDirty(page))
+	if (PageDirty(page)) {
+		if (!sync)
+			return -EBUSY;
 		return writeout(mapping, page);
+	}
 
 	/*
 	 * Buffers may be managed in a filesystem specific way.
@@ -549,7 +598,7 @@ static int fallback_migrate_page(struct address_space *mapping,
 	    !try_to_release_page(page, GFP_KERNEL))
 		return -EAGAIN;
 
-	return migrate_page(mapping, newpage, page);
+	return migrate_page(mapping, newpage, page, sync);
 }
 
 /*
@@ -585,29 +634,18 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 
 	mapping = page_mapping(page);
 	if (!mapping)
-		rc = migrate_page(mapping, newpage, page);
-	else {
+		rc = migrate_page(mapping, newpage, page, sync);
+	else if (mapping->a_ops->migratepage)
 		/*
-		 * Do not writeback pages if !sync and migratepage is
-		 * not pointing to migrate_page() which is nonblocking
-		 * (swapcache/tmpfs uses migratepage = migrate_page).
+		 * Most pages have a mapping and most filesystems provide a
+		 * migratepage callback. Anonymous pages are part of swap
+		 * space which also has its own migratepage callback. This
+		 * is the most common path for page migration.
 		 */
-		if (PageDirty(page) && !sync &&
-		    mapping->a_ops->migratepage != migrate_page)
-			rc = -EBUSY;
-		else if (mapping->a_ops->migratepage)
-			/*
-			 * Most pages have a mapping and most filesystems
-			 * should provide a migration function. Anonymous
-			 * pages are part of swap space which also has its
-			 * own migration function. This is the most common
-			 * path for page migration.
-			 */
-			rc = mapping->a_ops->migratepage(mapping,
-							newpage, page);
-		else
-			rc = fallback_migrate_page(mapping, newpage, page);
-	}
+		rc = mapping->a_ops->migratepage(mapping,
+						newpage, page, sync);
+	else
+		rc = fallback_migrate_page(mapping, newpage, page, sync);
 
 	if (rc) {
 		newpage->mapping = NULL;
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 5/5] mm: compaction: make isolate_lru_page() filter-aware again
  2011-11-18 16:58 [RFC PATCH 0/5] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v3 Mel Gorman
                   ` (3 preceding siblings ...)
  2011-11-18 16:58 ` [PATCH 4/5] mm: compaction: Determine if dirty pages can be migreated without blocking within ->migratepage Mel Gorman
@ 2011-11-18 16:58 ` Mel Gorman
  2011-11-19 19:54 ` [RFC PATCH 0/5] Reduce compaction-related stalls Andrea Arcangeli
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2011-11-18 16:58 UTC (permalink / raw)
  To: Linux-MM
  Cc: Andrea Arcangeli, Minchan Kim, Jan Kara, Andy Isaacson,
	Johannes Weiner, Mel Gorman, LKML

Commit [39deaf85: mm: compaction: make isolate_lru_page() filter-aware]
noted that compaction does not migrate dirty or writeback pages and
that is was meaningless to pick the page and re-add it to the LRU list.
This had to be partially reverted because some dirty pages can be
migrated by compaction without blocking.

This patch updates "mm: compaction: make isolate_lru_page" by skipping
over pages that migration has no possibility of migrating to minimise
LRU disruption.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 include/linux/mmzone.h |    2 ++
 mm/compaction.c        |    3 +++
 mm/vmscan.c            |   36 ++++++++++++++++++++++++++++++++++--
 3 files changed, 39 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 188cb2f..ac5b522 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -173,6 +173,8 @@ static inline int is_unevictable_lru(enum lru_list l)
 #define ISOLATE_CLEAN		((__force isolate_mode_t)0x4)
 /* Isolate unmapped file */
 #define ISOLATE_UNMAPPED	((__force isolate_mode_t)0x8)
+/* Isolate for asynchronous migration */
+#define ISOLATE_ASYNC_MIGRATE	((__force isolate_mode_t)0x10)
 
 /* LRU Isolation modes. */
 typedef unsigned __bitwise__ isolate_mode_t;
diff --git a/mm/compaction.c b/mm/compaction.c
index 615502b..0379263 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -349,6 +349,9 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 			continue;
 		}
 
+		if (!cc->sync)
+			mode |= ISOLATE_ASYNC_MIGRATE;
+
 		/* Try isolate the page */
 		if (__isolate_lru_page(page, mode, 0) != 0)
 			continue;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index a1893c0..8350186 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1061,8 +1061,40 @@ int __isolate_lru_page(struct page *page, isolate_mode_t mode, int file)
 
 	ret = -EBUSY;
 
-	if ((mode & ISOLATE_CLEAN) && (PageDirty(page) || PageWriteback(page)))
-		return ret;
+	/*
+	 * To minimise LRU disruption, the caller can indicate that it only
+	 * wants to isolate pages it will be able to operate on without
+	 * blocking - clean pages for the most part.
+	 *
+	 * ISOLATE_CLEAN means that only clean pages should be isolated. This
+	 * is used by reclaim when it is cannot write to backing storage
+	 *
+	 * ISOLATE_ASYNC_MIGRATE is used to indicate that it only wants to pages
+	 * that it is possible to migrate without blocking with a ->migratepage
+	 * handler
+	 */
+	if (mode & (ISOLATE_CLEAN|ISOLATE_ASYNC_MIGRATE)) {
+		/* All the caller can do on PageWriteback is block */
+		if (PageWriteback(page))
+			return ret;
+
+		if (PageDirty(page)) {
+			struct address_space *mapping;
+
+			/* ISOLATE_CLEAN means only clean pages */
+			if (mode & ISOLATE_CLEAN)
+				return ret;
+
+			/*
+			 * Only the ->migratepage callback knows if a dirty
+			 * page can be migrated without blocking. Skip the
+			 * page unless there is a ->migratepage callback.
+			 */
+			mapping = page_mapping(page);
+			if (!mapping || !mapping->a_ops->migratepage)
+				return ret;
+		}
+	}
 
 	if ((mode & ISOLATE_UNMAPPED) && page_mapped(page))
 		return ret;
-- 
1.7.3.4


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/5] mm: compaction: Use synchronous compaction for /proc/sys/vm/compact_memory
  2011-11-18 16:58 ` [PATCH 2/5] mm: compaction: Use synchronous compaction for /proc/sys/vm/compact_memory Mel Gorman
@ 2011-11-18 17:27   ` Andrea Arcangeli
  2011-11-21 21:46   ` Rik van Riel
  1 sibling, 0 replies; 40+ messages in thread
From: Andrea Arcangeli @ 2011-11-18 17:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, LKML

On Fri, Nov 18, 2011 at 04:58:41PM +0000, Mel Gorman wrote:
> When asynchronous compaction was introduced, the
> /proc/sys/vm/compact_memory handler should have been updated to always
> use synchronous compaction. This did not happen so this patch addresses
> it. The assumption is if a user writes to /proc/sys/vm/compact_memory,
> they are willing for that process to stall.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/compaction.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 237560e..615502b 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -666,6 +666,7 @@ static int compact_node(int nid)
>  			.nr_freepages = 0,
>  			.nr_migratepages = 0,
>  			.order = -1,
> +			.sync = true,
>  		};
>  
>  		zone = &pgdat->node_zones[zoneid];

Yep I noticed that yesterday too.

Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/5] mm: compaction: Allow compaction to isolate dirty pages
  2011-11-18 16:58 ` [PATCH 1/5] mm: compaction: Allow compaction to isolate dirty pages Mel Gorman
@ 2011-11-18 17:28   ` Andrea Arcangeli
  2011-11-21 17:16   ` Rik van Riel
  1 sibling, 0 replies; 40+ messages in thread
From: Andrea Arcangeli @ 2011-11-18 17:28 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, LKML

On Fri, Nov 18, 2011 at 04:58:40PM +0000, Mel Gorman wrote:
> Commit [39deaf85: mm: compaction: make isolate_lru_page() filter-aware]
> noted that compaction does not migrate dirty or writeback pages and
> that is was meaningless to pick the page and re-add it to the LRU list.
> 
> What was missed during review is that asynchronous migration moves
> dirty pages if their ->migratepage callback is migrate_page() because
> these can be moved without blocking. This potentially impacted
> hugepage allocation success rates by a factor depending on how many
> dirty pages are in the system.
> 
> This patch partially reverts 39deaf85 to allow migration to isolate
> dirty pages again. This increases how much compaction disrupts the
> LRU but that is addressed later in the series.
> 
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  mm/compaction.c |    3 ---
>  1 files changed, 0 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 899d956..237560e 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -349,9 +349,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
>  			continue;
>  		}
>  
> -		if (!cc->sync)
> -			mode |= ISOLATE_CLEAN;
> -
>  		/* Try isolate the page */
>  		if (__isolate_lru_page(page, mode, 0) != 0)
>  			continue;

Reviewed-by: Andrea Arcangeli <aarcange@redhat.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/5] mm: Do not stall in synchronous compaction for THP allocations
  2011-11-18 16:58 ` [PATCH 3/5] mm: Do not stall in synchronous compaction for THP allocations Mel Gorman
@ 2011-11-18 17:34   ` Andrea Arcangeli
  0 siblings, 0 replies; 40+ messages in thread
From: Andrea Arcangeli @ 2011-11-18 17:34 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, LKML

On Fri, Nov 18, 2011 at 04:58:42PM +0000, Mel Gorman wrote:
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 3a76faf..ef1b1af 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -83,7 +83,18 @@ struct vm_area_struct;
>  #define __GFP_RECLAIMABLE ((__force gfp_t)___GFP_RECLAIMABLE) /* Page is reclaimable */
>  #define __GFP_NOTRACK	((__force gfp_t)___GFP_NOTRACK)  /* Don't track with kmemcheck */
>  
> +/*
> + * __GFP_NO_KSWAPD indicates that the VM should favour failing the allocation
> + * over excessive disruption of the system. Currently this means
> + * 1. Do not wake kswapd (hence the flag name)
> + * 2. Do not use stall in synchronous compaction for high-order allocations
> + *    as this may cause the caller to stall writing out pages
> + *
> + * This flag it primarily intended for use with transparent hugepage support.
> + * If the flag is used outside the VM, linux-mm should be cc'd for review.
> + */
>  #define __GFP_NO_KSWAPD	((__force gfp_t)___GFP_NO_KSWAPD)
> +
>  #define __GFP_OTHER_NODE ((__force gfp_t)___GFP_OTHER_NODE) /* On behalf of other node */
>  
>  /*
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9dd443d..7a5c5b1 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2168,7 +2168,14 @@ rebalance:
>  					sync_migration);
>  	if (page)
>  		goto got_pg;
> -	sync_migration = true;
> +
> +	/*
> +	 * Do not use sync migration if __GFP_NO_KSWAPD is used to indicate
> +	 * the system should not be heavily disrupted. In practice, this is
> +	 * to avoid THP callers being stalled in writeback during migration
> +	 * as it's preferable for the the allocations to fail than to stall
> +	 */
> +	sync_migration = !(gfp_mask & __GFP_NO_KSWAPD);
>  
>  	/* Try direct reclaim and then allocating */
>  	page = __alloc_pages_direct_reclaim(gfp_mask, order,

I don't like this anymore. Also with sync migration compaction will
only scan only the migrate movable pageblocks.

What I'm running right now is this:

From: Andrea Arcangeli <aarcange@redhat.com>
Subject: [PATCH] compaction: avoid overwork in migrate sync mode

Add a lightweight sync migration (sync == 2) mode that avoids overwork
so more suitable to be used by compaction to provide lower latency but
still write throttling.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/migrate.h |    8 ++--
 mm/compaction.c         |    2 +-
 mm/memory-failure.c     |    2 +-
 mm/memory_hotplug.c     |    2 +-
 mm/mempolicy.c          |    4 +-
 mm/migrate.c            |   95 +++++++++++++++++++++++++++++++----------------
 6 files changed, 72 insertions(+), 41 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index e39aeec..f26fc0e 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -14,10 +14,10 @@ extern int migrate_page(struct address_space *,
 			struct page *, struct page *);
 extern int migrate_pages(struct list_head *l, new_page_t x,
 			unsigned long private, bool offlining,
-			bool sync);
+			int sync);
 extern int migrate_huge_pages(struct list_head *l, new_page_t x,
 			unsigned long private, bool offlining,
-			bool sync);
+			int sync);
 
 extern int fail_migrate_page(struct address_space *,
 			struct page *, struct page *);
@@ -36,10 +36,10 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 static inline void putback_lru_pages(struct list_head *l) {}
 static inline int migrate_pages(struct list_head *l, new_page_t x,
 		unsigned long private, bool offlining,
-		bool sync) { return -ENOSYS; }
+		int sync) { return -ENOSYS; }
 static inline int migrate_huge_pages(struct list_head *l, new_page_t x,
 		unsigned long private, bool offlining,
-		bool sync) { return -ENOSYS; }
+		int sync) { return -ENOSYS; }
 
 static inline int migrate_prep(void) { return -ENOSYS; }
 static inline int migrate_prep_local(void) { return -ENOSYS; }
diff --git a/mm/compaction.c b/mm/compaction.c
index 899d956..48a3579 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -555,7 +555,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 		nr_migrate = cc->nr_migratepages;
 		err = migrate_pages(&cc->migratepages, compaction_alloc,
 				(unsigned long)cc, false,
-				cc->sync);
+				cc->sync ? 2 : 0);
 		update_nr_listpages(cc);
 		nr_remaining = cc->nr_migratepages;
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 06d3479..d8a41d3 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1557,7 +1557,7 @@ int soft_offline_page(struct page *page, int flags)
 					    page_is_file_cache(page));
 		list_add(&page->lru, &pagelist);
 		ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
-								0, true);
+				    false, 1);
 		if (ret) {
 			putback_lru_pages(&pagelist);
 			pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2168489..e1d6176 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -809,7 +809,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 		}
 		/* this function returns # of failed pages */
 		ret = migrate_pages(&source, hotremove_migrate_alloc, 0,
-								true, true);
+				    true, 1);
 		if (ret)
 			putback_lru_pages(&source);
 	}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index adc3954..0bf88ed 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -933,7 +933,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
 
 	if (!list_empty(&pagelist)) {
 		err = migrate_pages(&pagelist, new_node_page, dest,
-								false, true);
+				    false, 1);
 		if (err)
 			putback_lru_pages(&pagelist);
 	}
@@ -1154,7 +1154,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 		if (!list_empty(&pagelist)) {
 			nr_failed = migrate_pages(&pagelist, new_vma_page,
 						(unsigned long)vma,
-						false, true);
+						false, 1);
 			if (nr_failed)
 				putback_lru_pages(&pagelist);
 		}
diff --git a/mm/migrate.c b/mm/migrate.c
index 578e291..612c3ba 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -564,7 +564,7 @@ static int fallback_migrate_page(struct address_space *mapping,
  *  == 0 - success
  */
 static int move_to_new_page(struct page *newpage, struct page *page,
-					int remap_swapcache, bool sync)
+			    int remap_swapcache, bool force)
 {
 	struct address_space *mapping;
 	int rc;
@@ -588,11 +588,11 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 		rc = migrate_page(mapping, newpage, page);
 	else {
 		/*
-		 * Do not writeback pages if !sync and migratepage is
+		 * Do not writeback pages if !force and migratepage is
 		 * not pointing to migrate_page() which is nonblocking
 		 * (swapcache/tmpfs uses migratepage = migrate_page).
 		 */
-		if (PageDirty(page) && !sync &&
+		if (PageDirty(page) && !force &&
 		    mapping->a_ops->migratepage != migrate_page)
 			rc = -EBUSY;
 		else if (mapping->a_ops->migratepage)
@@ -622,7 +622,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 }
 
 static int __unmap_and_move(struct page *page, struct page *newpage,
-				int force, bool offlining, bool sync)
+			    bool force, bool offlining)
 {
 	int rc = -EAGAIN;
 	int remap_swapcache = 1;
@@ -631,7 +631,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	struct anon_vma *anon_vma = NULL;
 
 	if (!trylock_page(page)) {
-		if (!force || !sync)
+		if (!force)
 			goto out;
 
 		/*
@@ -676,14 +676,6 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	BUG_ON(charge);
 
 	if (PageWriteback(page)) {
-		/*
-		 * For !sync, there is no point retrying as the retry loop
-		 * is expected to be too short for PageWriteback to be cleared
-		 */
-		if (!sync) {
-			rc = -EBUSY;
-			goto uncharge;
-		}
 		if (!force)
 			goto uncharge;
 		wait_on_page_writeback(page);
@@ -751,7 +743,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 
 skip_unmap:
 	if (!page_mapped(page))
-		rc = move_to_new_page(newpage, page, remap_swapcache, sync);
+		rc = move_to_new_page(newpage, page, remap_swapcache, force);
 
 	if (rc && remap_swapcache)
 		remove_migration_ptes(page, page);
@@ -774,7 +766,7 @@ out:
  * to the newly allocated page in newpage.
  */
 static int unmap_and_move(new_page_t get_new_page, unsigned long private,
-			struct page *page, int force, bool offlining, bool sync)
+			  struct page *page, bool force, bool offlining)
 {
 	int rc = 0;
 	int *result = NULL;
@@ -792,7 +784,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 		if (unlikely(split_huge_page(page)))
 			goto out;
 
-	rc = __unmap_and_move(page, newpage, force, offlining, sync);
+	rc = __unmap_and_move(page, newpage, force, offlining);
 out:
 	if (rc != -EAGAIN) {
 		/*
@@ -840,7 +832,7 @@ out:
  */
 static int unmap_and_move_huge_page(new_page_t get_new_page,
 				unsigned long private, struct page *hpage,
-				int force, bool offlining, bool sync)
+				bool force, bool offlining)
 {
 	int rc = 0;
 	int *result = NULL;
@@ -853,7 +845,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 	rc = -EAGAIN;
 
 	if (!trylock_page(hpage)) {
-		if (!force || !sync)
+		if (!force)
 			goto out;
 		lock_page(hpage);
 	}
@@ -864,7 +856,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
 
 	if (!page_mapped(hpage))
-		rc = move_to_new_page(new_hpage, hpage, 1, sync);
+		rc = move_to_new_page(new_hpage, hpage, 1, force);
 
 	if (rc)
 		remove_migration_ptes(hpage, hpage);
@@ -907,11 +899,11 @@ out:
  */
 int migrate_pages(struct list_head *from,
 		new_page_t get_new_page, unsigned long private, bool offlining,
-		bool sync)
+		int sync)
 {
 	int retry = 1;
 	int nr_failed = 0;
-	int pass = 0;
+	int pass, passes;
 	struct page *page;
 	struct page *page2;
 	int swapwrite = current->flags & PF_SWAPWRITE;
@@ -920,15 +912,34 @@ int migrate_pages(struct list_head *from,
 	if (!swapwrite)
 		current->flags |= PF_SWAPWRITE;
 
-	for(pass = 0; pass < 10 && retry; pass++) {
+	switch (sync) {
+	case 0:
+		/* 3 async  0 sync */
+		passes = 3;
+		break;
+	case 1:
+		/* 3 async, 7 sync */
+		passes = 10;
+		break;
+	case 2:
+		/*
+		 * sync = 2 asks for a lightweight synchronous mode:
+		 * 3 async, 2 sync.
+		 */
+		passes = 5;
+		break;
+	default:
+		BUG();
+	}
+
+	for(pass = 0; pass < passes && retry; pass++) {
 		retry = 0;
 
 		list_for_each_entry_safe(page, page2, from, lru) {
 			cond_resched();
 
-			rc = unmap_and_move(get_new_page, private,
-						page, pass > 2, offlining,
-						sync);
+			rc = unmap_and_move(get_new_page, private, page,
+					    pass > 2, offlining);
 
 			switch(rc) {
 			case -ENOMEM:
@@ -958,24 +969,44 @@ out:
 
 int migrate_huge_pages(struct list_head *from,
 		new_page_t get_new_page, unsigned long private, bool offlining,
-		bool sync)
+		int sync)
 {
 	int retry = 1;
 	int nr_failed = 0;
-	int pass = 0;
+	int pass, passes;
 	struct page *page;
 	struct page *page2;
 	int rc;
 
-	for (pass = 0; pass < 10 && retry; pass++) {
+	switch (sync) {
+	case 0:
+		/* 3 async  0 sync */
+		passes = 3;
+		break;
+	case 1:
+		/* 3 async, 7 sync */
+		passes = 10;
+		break;
+	case 2:
+		/*
+		 * sync = 2 asks for a lightweight synchronous mode:
+		 * 3 async, 2 sync.
+		 */
+		passes = 5;
+		break;
+	default:
+		BUG();
+	}
+
+	for (pass = 0; pass < passes && retry; pass++) {
 		retry = 0;
 
 		list_for_each_entry_safe(page, page2, from, lru) {
 			cond_resched();
 
-			rc = unmap_and_move_huge_page(get_new_page,
-					private, page, pass > 2, offlining,
-					sync);
+			rc = unmap_and_move_huge_page(get_new_page, private,
+						      page, pass > 2,
+						      offlining);
 
 			switch(rc) {
 			case -ENOMEM:
@@ -1104,7 +1135,7 @@ set_status:
 	err = 0;
 	if (!list_empty(&pagelist)) {
 		err = migrate_pages(&pagelist, new_page_node,
-				(unsigned long)pm, 0, true);
+				    (unsigned long)pm, false, 1);
 		if (err)
 			putback_lru_pages(&pagelist);
 	}

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/5] mm: compaction: Determine if dirty pages can be migreated without blocking within ->migratepage
  2011-11-18 16:58 ` [PATCH 4/5] mm: compaction: Determine if dirty pages can be migreated without blocking within ->migratepage Mel Gorman
@ 2011-11-18 21:35   ` Andrea Arcangeli
  2011-11-21 11:17     ` Mel Gorman
  2011-11-19  8:59   ` Nai Xia
  1 sibling, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2011-11-18 21:35 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, LKML

On Fri, Nov 18, 2011 at 04:58:43PM +0000, Mel Gorman wrote:
> +	/* async case, we cannot block on lock_buffer so use trylock_buffer */
> +	do {
> +		get_bh(bh);
> +		if (!trylock_buffer(bh)) {
> +			/*
> +			 * We failed to lock the buffer and cannot stall in
> +			 * async migration. Release the taken locks
> +			 */
> +			struct buffer_head *failed_bh = bh;
> +			bh = head;
> +			do {
> +				unlock_buffer(bh);
> +				put_bh(bh);
> +				bh = bh->b_this_page;
> +			} while (bh != failed_bh);
> +			return false;

here if blocksize is < PAGE_SIZE you're leaking one get_bh
(memleak). If blocksize is PAGE_SIZE (common) you're unlocking a
locked bh leading to fs corruption.
> +	if (!buffer_migrate_lock_buffers(head, sync)) {
> +		/*
> +		 * We have to revert the radix tree update. If this returns
> +		 * non-zero, it either means that the page count changed
> +		 * which "can't happen" or the slot changed from underneath
> +		 * us in which case someone operated on a page that did not
> +		 * have buffers fully migrated which is alarming so warn
> +		 * that it happened.
> +		 */
> +		WARN_ON(migrate_page_move_mapping(mapping, page, newpage));

speculative pagecache lookups can actually increase the count, the
freezing is released before returning from
migrate_page_move_mapping. It's not alarming that pagecache lookup
flips bit all over the place. The only way to stop them is the
page_freeze_refs.

folks who wants low latency or no memory overhead should simply
disable compaction. In my tests these "lowlatency" changes, notably
the change in vmscan that is already upstream breaks thp allocation
reliability, the __GFP_NO_KSWAPD check too should be dropped I think,
it's good thing we dropped it because the sync migrate is needed or
the above pages with bh to migrate would become "unmovable" despite
they're allocated in "movable" pageblocks.

The workload to test is:

cp /dev/sda /dev/null &
cp /dev/zero /media/someusb/zero &
wait free memory to reach minimum level
./largepage (allocate some gigabyte of hugepages)
grep thp /proc/vmstat

Anything that leads to a thp allocation failure rate of this workload
of 50% should be banned and all compaction patches (including vmscan
changes) should go through the above workload.

I got back to the previous state and there's <10% of failures even in
the above workload (and close to 100% in normal load but it's harder
to define normal load while the above is pretty easy to define).

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/5] mm: compaction: Determine if dirty pages can be migreated without blocking within ->migratepage
  2011-11-18 16:58 ` [PATCH 4/5] mm: compaction: Determine if dirty pages can be migreated without blocking within ->migratepage Mel Gorman
  2011-11-18 21:35   ` Andrea Arcangeli
@ 2011-11-19  8:59   ` Nai Xia
  2011-11-19  9:48     ` Nai Xia
  2011-11-21 11:19     ` Mel Gorman
  1 sibling, 2 replies; 40+ messages in thread
From: Nai Xia @ 2011-11-19  8:59 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Andrea Arcangeli, Minchan Kim, Jan Kara, Andy Isaacson,
	Johannes Weiner, LKML

On Sat, Nov 19, 2011 at 12:58 AM, Mel Gorman <mgorman@suse.de> wrote:
> Asynchronous compaction is when allocating transparent hugepages to
> avoid blocking for long periods of time. Due to reports of stalling,
> synchronous compaction is never used but this impacts allocation
> success rates. When deciding whether to migrate dirty pages, the
> following check is made
>
>        if (PageDirty(page) && !sync &&
>                mapping->a_ops->migratepage != migrate_page)
>                        rc = -EBUSY;
>
> This skips over all pages using buffer_migrate_page() even though
> it is possible to migrate some of these pages without blocking. This
> patch updates the ->migratepage callback with a "sync" parameter. It
> is the resposibility of the callback to gracefully fail migration of
> the page if it cannot be achieved without blocking.
>
> Signed-off-by: Mel Gorman <mgorman@suse.de>
> ---
>  fs/btrfs/disk-io.c      |    2 +-
>  fs/nfs/internal.h       |    2 +-
>  fs/nfs/write.c          |    4 +-
>  include/linux/fs.h      |    9 +++-
>  include/linux/migrate.h |    2 +-
>  mm/migrate.c            |  106 ++++++++++++++++++++++++++++++++---------------
>  6 files changed, 83 insertions(+), 42 deletions(-)
>
> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
> index 62afe5c..f841f00 100644
> --- a/fs/btrfs/disk-io.c
> +++ b/fs/btrfs/disk-io.c
> @@ -872,7 +872,7 @@ static int btree_submit_bio_hook(struct inode *inode, int rw, struct bio *bio,
>
>  #ifdef CONFIG_MIGRATION
>  static int btree_migratepage(struct address_space *mapping,
> -                       struct page *newpage, struct page *page)
> +                       struct page *newpage, struct page *page, bool sync)
>  {
>        /*
>         * we can't safely write a btree page from here,
> diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
> index c1a1bd8..d0c460f 100644
> --- a/fs/nfs/internal.h
> +++ b/fs/nfs/internal.h
> @@ -328,7 +328,7 @@ void nfs_commit_release_pages(struct nfs_write_data *data);
>
>  #ifdef CONFIG_MIGRATION
>  extern int nfs_migrate_page(struct address_space *,
> -               struct page *, struct page *);
> +               struct page *, struct page *, bool);
>  #else
>  #define nfs_migrate_page NULL
>  #endif
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index 1dda78d..33475df 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -1711,7 +1711,7 @@ out_error:
>
>  #ifdef CONFIG_MIGRATION
>  int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
> -               struct page *page)
> +               struct page *page, bool sync)
>  {
>        /*
>         * If PagePrivate is set, then the page is currently associated with
> @@ -1726,7 +1726,7 @@ int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
>
>        nfs_fscache_release_page(page, GFP_KERNEL);
>
> -       return migrate_page(mapping, newpage, page);
> +       return migrate_page(mapping, newpage, page, sync);
>  }
>  #endif
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 0c4df26..67f8e46 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -609,9 +609,12 @@ struct address_space_operations {
>                        loff_t offset, unsigned long nr_segs);
>        int (*get_xip_mem)(struct address_space *, pgoff_t, int,
>                                                void **, unsigned long *);
> -       /* migrate the contents of a page to the specified target */
> +       /*
> +        * migrate the contents of a page to the specified target. If sync
> +        * is false, it must not block. If it needs to block, return -EBUSY
> +        */
>        int (*migratepage) (struct address_space *,
> -                       struct page *, struct page *);
> +                       struct page *, struct page *, bool);
>        int (*launder_page) (struct page *);
>        int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
>                                        unsigned long);
> @@ -2577,7 +2580,7 @@ extern int generic_check_addressable(unsigned, u64);
>
>  #ifdef CONFIG_MIGRATION
>  extern int buffer_migrate_page(struct address_space *,
> -                               struct page *, struct page *);
> +                               struct page *, struct page *, bool);
>  #else
>  #define buffer_migrate_page NULL
>  #endif
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index e39aeec..14e6d2a 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -11,7 +11,7 @@ typedef struct page *new_page_t(struct page *, unsigned long private, int **);
>
>  extern void putback_lru_pages(struct list_head *l);
>  extern int migrate_page(struct address_space *,
> -                       struct page *, struct page *);
> +                       struct page *, struct page *, bool);
>  extern int migrate_pages(struct list_head *l, new_page_t x,
>                        unsigned long private, bool offlining,
>                        bool sync);
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 578e291..8395697 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -415,7 +415,7 @@ EXPORT_SYMBOL(fail_migrate_page);
>  * Pages are locked upon entry and exit.
>  */
>  int migrate_page(struct address_space *mapping,
> -               struct page *newpage, struct page *page)
> +               struct page *newpage, struct page *page, bool sync)
>  {
>        int rc;
>
> @@ -432,19 +432,60 @@ int migrate_page(struct address_space *mapping,
>  EXPORT_SYMBOL(migrate_page);
>
>  #ifdef CONFIG_BLOCK
> +
> +/* Returns true if all buffers are successfully locked */
> +bool buffer_migrate_lock_buffers(struct buffer_head *head, bool sync)
> +{
> +       struct buffer_head *bh = head;
> +
> +       /* Simple case, sync compaction */
> +       if (sync) {
> +               do {
> +                       get_bh(bh);
> +                       lock_buffer(bh);
> +                       bh = bh->b_this_page;
> +
> +               } while (bh != head);
> +
> +               return true;
> +       }
> +
> +       /* async case, we cannot block on lock_buffer so use trylock_buffer */
> +       do {
> +               get_bh(bh);
> +               if (!trylock_buffer(bh)) {
> +                       /*
> +                        * We failed to lock the buffer and cannot stall in
> +                        * async migration. Release the taken locks
> +                        */
> +                       struct buffer_head *failed_bh = bh;
> +                       bh = head;
> +                       do {
> +                               unlock_buffer(bh);
> +                               put_bh(bh);
> +                               bh = bh->b_this_page;
> +                       } while (bh != failed_bh);
> +                       return false;
> +               }
> +
> +               bh = bh->b_this_page;
> +       } while (bh != head);
> +       return true;
> +}
> +
>  /*
>  * Migration function for pages with buffers. This function can only be used
>  * if the underlying filesystem guarantees that no other references to "page"
>  * exist.
>  */
>  int buffer_migrate_page(struct address_space *mapping,
> -               struct page *newpage, struct page *page)
> +               struct page *newpage, struct page *page, bool sync)
>  {
>        struct buffer_head *bh, *head;
>        int rc;
>
>        if (!page_has_buffers(page))
> -               return migrate_page(mapping, newpage, page);
> +               return migrate_page(mapping, newpage, page, sync);
>
>        head = page_buffers(page);
>
> @@ -453,13 +494,18 @@ int buffer_migrate_page(struct address_space *mapping,
>        if (rc)
>                return rc;
>
> -       bh = head;
> -       do {
> -               get_bh(bh);
> -               lock_buffer(bh);
> -               bh = bh->b_this_page;
> -
> -       } while (bh != head);
> +       if (!buffer_migrate_lock_buffers(head, sync)) {
> +               /*
> +                * We have to revert the radix tree update. If this returns
> +                * non-zero, it either means that the page count changed
> +                * which "can't happen" or the slot changed from underneath
> +                * us in which case someone operated on a page that did not
> +                * have buffers fully migrated which is alarming so warn
> +                * that it happened.
> +                */
> +               WARN_ON(migrate_page_move_mapping(mapping, page, newpage));
> +               return -EBUSY;

If this migrate_page_move_mapping() really fails, seems disk IO will be needed
to bring the previously already cached page back, I wonder if we should make the
double check for the two conditions of "page refs is ok " and "all bh
trylocked"
before doing radix_tree_replace_slot() ? which I think does not
involve IO on the
error path.


Nai

> +       }
>
>        ClearPagePrivate(page);
>        set_page_private(newpage, page_private(page));
> @@ -536,10 +582,13 @@ static int writeout(struct address_space *mapping, struct page *page)
>  * Default handling if a filesystem does not provide a migration function.
>  */
>  static int fallback_migrate_page(struct address_space *mapping,
> -       struct page *newpage, struct page *page)
> +       struct page *newpage, struct page *page, bool sync)
>  {
> -       if (PageDirty(page))
> +       if (PageDirty(page)) {
> +               if (!sync)
> +                       return -EBUSY;
>                return writeout(mapping, page);
> +       }
>
>        /*
>         * Buffers may be managed in a filesystem specific way.
> @@ -549,7 +598,7 @@ static int fallback_migrate_page(struct address_space *mapping,
>            !try_to_release_page(page, GFP_KERNEL))
>                return -EAGAIN;
>
> -       return migrate_page(mapping, newpage, page);
> +       return migrate_page(mapping, newpage, page, sync);
>  }
>
>  /*
> @@ -585,29 +634,18 @@ static int move_to_new_page(struct page *newpage, struct page *page,
>
>        mapping = page_mapping(page);
>        if (!mapping)
> -               rc = migrate_page(mapping, newpage, page);
> -       else {
> +               rc = migrate_page(mapping, newpage, page, sync);
> +       else if (mapping->a_ops->migratepage)
>                /*
> -                * Do not writeback pages if !sync and migratepage is
> -                * not pointing to migrate_page() which is nonblocking
> -                * (swapcache/tmpfs uses migratepage = migrate_page).
> +                * Most pages have a mapping and most filesystems provide a
> +                * migratepage callback. Anonymous pages are part of swap
> +                * space which also has its own migratepage callback. This
> +                * is the most common path for page migration.
>                 */
> -               if (PageDirty(page) && !sync &&
> -                   mapping->a_ops->migratepage != migrate_page)
> -                       rc = -EBUSY;
> -               else if (mapping->a_ops->migratepage)
> -                       /*
> -                        * Most pages have a mapping and most filesystems
> -                        * should provide a migration function. Anonymous
> -                        * pages are part of swap space which also has its
> -                        * own migration function. This is the most common
> -                        * path for page migration.
> -                        */
> -                       rc = mapping->a_ops->migratepage(mapping,
> -                                                       newpage, page);
> -               else
> -                       rc = fallback_migrate_page(mapping, newpage, page);
> -       }
> +               rc = mapping->a_ops->migratepage(mapping,
> +                                               newpage, page, sync);
> +       else
> +               rc = fallback_migrate_page(mapping, newpage, page, sync);
>
>        if (rc) {
>                newpage->mapping = NULL;
> --
> 1.7.3.4
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/5] mm: compaction: Determine if dirty pages can be migreated without blocking within ->migratepage
  2011-11-19  8:59   ` Nai Xia
@ 2011-11-19  9:48     ` Nai Xia
  2011-11-21 11:19     ` Mel Gorman
  1 sibling, 0 replies; 40+ messages in thread
From: Nai Xia @ 2011-11-19  9:48 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Andrea Arcangeli, Minchan Kim, Jan Kara, Andy Isaacson,
	Johannes Weiner, LKML

On Sat, Nov 19, 2011 at 4:59 PM, Nai Xia <nai.xia@gmail.com> wrote:
> On Sat, Nov 19, 2011 at 12:58 AM, Mel Gorman <mgorman@suse.de> wrote:
>> Asynchronous compaction is when allocating transparent hugepages to
>> avoid blocking for long periods of time. Due to reports of stalling,
>> synchronous compaction is never used but this impacts allocation
>> success rates. When deciding whether to migrate dirty pages, the
>> following check is made
>>
>>        if (PageDirty(page) && !sync &&
>>                mapping->a_ops->migratepage != migrate_page)
>>                        rc = -EBUSY;
>>
>> This skips over all pages using buffer_migrate_page() even though
>> it is possible to migrate some of these pages without blocking. This
>> patch updates the ->migratepage callback with a "sync" parameter. It
>> is the resposibility of the callback to gracefully fail migration of
>> the page if it cannot be achieved without blocking.
>>
>> Signed-off-by: Mel Gorman <mgorman@suse.de>
>> ---
>>  fs/btrfs/disk-io.c      |    2 +-
>>  fs/nfs/internal.h       |    2 +-
>>  fs/nfs/write.c          |    4 +-
>>  include/linux/fs.h      |    9 +++-
>>  include/linux/migrate.h |    2 +-
>>  mm/migrate.c            |  106 ++++++++++++++++++++++++++++++++---------------
>>  6 files changed, 83 insertions(+), 42 deletions(-)
>>
>> diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
>> index 62afe5c..f841f00 100644
>> --- a/fs/btrfs/disk-io.c
>> +++ b/fs/btrfs/disk-io.c
>> @@ -872,7 +872,7 @@ static int btree_submit_bio_hook(struct inode *inode, int rw, struct bio *bio,
>>
>>  #ifdef CONFIG_MIGRATION
>>  static int btree_migratepage(struct address_space *mapping,
>> -                       struct page *newpage, struct page *page)
>> +                       struct page *newpage, struct page *page, bool sync)
>>  {
>>        /*
>>         * we can't safely write a btree page from here,
>> diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
>> index c1a1bd8..d0c460f 100644
>> --- a/fs/nfs/internal.h
>> +++ b/fs/nfs/internal.h
>> @@ -328,7 +328,7 @@ void nfs_commit_release_pages(struct nfs_write_data *data);
>>
>>  #ifdef CONFIG_MIGRATION
>>  extern int nfs_migrate_page(struct address_space *,
>> -               struct page *, struct page *);
>> +               struct page *, struct page *, bool);
>>  #else
>>  #define nfs_migrate_page NULL
>>  #endif
>> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
>> index 1dda78d..33475df 100644
>> --- a/fs/nfs/write.c
>> +++ b/fs/nfs/write.c
>> @@ -1711,7 +1711,7 @@ out_error:
>>
>>  #ifdef CONFIG_MIGRATION
>>  int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
>> -               struct page *page)
>> +               struct page *page, bool sync)
>>  {
>>        /*
>>         * If PagePrivate is set, then the page is currently associated with
>> @@ -1726,7 +1726,7 @@ int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
>>
>>        nfs_fscache_release_page(page, GFP_KERNEL);
>>
>> -       return migrate_page(mapping, newpage, page);
>> +       return migrate_page(mapping, newpage, page, sync);
>>  }
>>  #endif
>>
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 0c4df26..67f8e46 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -609,9 +609,12 @@ struct address_space_operations {
>>                        loff_t offset, unsigned long nr_segs);
>>        int (*get_xip_mem)(struct address_space *, pgoff_t, int,
>>                                                void **, unsigned long *);
>> -       /* migrate the contents of a page to the specified target */
>> +       /*
>> +        * migrate the contents of a page to the specified target. If sync
>> +        * is false, it must not block. If it needs to block, return -EBUSY
>> +        */
>>        int (*migratepage) (struct address_space *,
>> -                       struct page *, struct page *);
>> +                       struct page *, struct page *, bool);
>>        int (*launder_page) (struct page *);
>>        int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
>>                                        unsigned long);
>> @@ -2577,7 +2580,7 @@ extern int generic_check_addressable(unsigned, u64);
>>
>>  #ifdef CONFIG_MIGRATION
>>  extern int buffer_migrate_page(struct address_space *,
>> -                               struct page *, struct page *);
>> +                               struct page *, struct page *, bool);
>>  #else
>>  #define buffer_migrate_page NULL
>>  #endif
>> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
>> index e39aeec..14e6d2a 100644
>> --- a/include/linux/migrate.h
>> +++ b/include/linux/migrate.h
>> @@ -11,7 +11,7 @@ typedef struct page *new_page_t(struct page *, unsigned long private, int **);
>>
>>  extern void putback_lru_pages(struct list_head *l);
>>  extern int migrate_page(struct address_space *,
>> -                       struct page *, struct page *);
>> +                       struct page *, struct page *, bool);
>>  extern int migrate_pages(struct list_head *l, new_page_t x,
>>                        unsigned long private, bool offlining,
>>                        bool sync);
>> diff --git a/mm/migrate.c b/mm/migrate.c
>> index 578e291..8395697 100644
>> --- a/mm/migrate.c
>> +++ b/mm/migrate.c
>> @@ -415,7 +415,7 @@ EXPORT_SYMBOL(fail_migrate_page);
>>  * Pages are locked upon entry and exit.
>>  */
>>  int migrate_page(struct address_space *mapping,
>> -               struct page *newpage, struct page *page)
>> +               struct page *newpage, struct page *page, bool sync)
>>  {
>>        int rc;
>>
>> @@ -432,19 +432,60 @@ int migrate_page(struct address_space *mapping,
>>  EXPORT_SYMBOL(migrate_page);
>>
>>  #ifdef CONFIG_BLOCK
>> +
>> +/* Returns true if all buffers are successfully locked */
>> +bool buffer_migrate_lock_buffers(struct buffer_head *head, bool sync)
>> +{
>> +       struct buffer_head *bh = head;
>> +
>> +       /* Simple case, sync compaction */
>> +       if (sync) {
>> +               do {
>> +                       get_bh(bh);
>> +                       lock_buffer(bh);
>> +                       bh = bh->b_this_page;
>> +
>> +               } while (bh != head);
>> +
>> +               return true;
>> +       }
>> +
>> +       /* async case, we cannot block on lock_buffer so use trylock_buffer */
>> +       do {
>> +               get_bh(bh);
>> +               if (!trylock_buffer(bh)) {
>> +                       /*
>> +                        * We failed to lock the buffer and cannot stall in
>> +                        * async migration. Release the taken locks
>> +                        */
>> +                       struct buffer_head *failed_bh = bh;
>> +                       bh = head;
>> +                       do {
>> +                               unlock_buffer(bh);
>> +                               put_bh(bh);
>> +                               bh = bh->b_this_page;
>> +                       } while (bh != failed_bh);
>> +                       return false;
>> +               }
>> +
>> +               bh = bh->b_this_page;
>> +       } while (bh != head);
>> +       return true;
>> +}
>> +
>>  /*
>>  * Migration function for pages with buffers. This function can only be used
>>  * if the underlying filesystem guarantees that no other references to "page"
>>  * exist.
>>  */
>>  int buffer_migrate_page(struct address_space *mapping,
>> -               struct page *newpage, struct page *page)
>> +               struct page *newpage, struct page *page, bool sync)
>>  {
>>        struct buffer_head *bh, *head;
>>        int rc;
>>
>>        if (!page_has_buffers(page))
>> -               return migrate_page(mapping, newpage, page);
>> +               return migrate_page(mapping, newpage, page, sync);
>>
>>        head = page_buffers(page);
>>
>> @@ -453,13 +494,18 @@ int buffer_migrate_page(struct address_space *mapping,
>>        if (rc)
>>                return rc;
>>
>> -       bh = head;
>> -       do {
>> -               get_bh(bh);
>> -               lock_buffer(bh);
>> -               bh = bh->b_this_page;
>> -
>> -       } while (bh != head);
>> +       if (!buffer_migrate_lock_buffers(head, sync)) {
>> +               /*
>> +                * We have to revert the radix tree update. If this returns
>> +                * non-zero, it either means that the page count changed
>> +                * which "can't happen" or the slot changed from underneath
>> +                * us in which case someone operated on a page that did not
>> +                * have buffers fully migrated which is alarming so warn
>> +                * that it happened.
>> +                */
>> +               WARN_ON(migrate_page_move_mapping(mapping, page, newpage));
>> +               return -EBUSY;
>
> If this migrate_page_move_mapping() really fails, seems disk IO will be needed
> to bring the previously already cached page back, I wonder if we should make the

Oh, I mean for clean pages. And for dirty pages, will their content get lost on
this error path?

> double check for the two conditions of "page refs is ok " and "all bh
> trylocked"
> before doing radix_tree_replace_slot() ? which I think does not
> involve IO on the
> error path.
>
>
> Nai
>
>> +       }
>>
>>        ClearPagePrivate(page);

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [RFC PATCH 0/5] Reduce compaction-related stalls...
  2011-11-18 16:58 [RFC PATCH 0/5] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v3 Mel Gorman
                   ` (4 preceding siblings ...)
  2011-11-18 16:58 ` [PATCH 5/5] mm: compaction: make isolate_lru_page() filter-aware again Mel Gorman
@ 2011-11-19 19:54 ` Andrea Arcangeli
  2011-11-19 19:54 ` [PATCH 1/8] mm: compaction: Allow compaction to isolate dirty pages Andrea Arcangeli
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 40+ messages in thread
From: Andrea Arcangeli @ 2011-11-19 19:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Minchan Kim, Jan Kara, Andy Isaacson,
	Johannes Weiner, linux-kernel

Hi Mel,

these are the last compaction/MM related patches I'm using, to me it
looks like the latest changes regressed the reliability of compaction,
and with the below I seem to get good reliability instead.

The __GFP_NO_KSWAPD would also definitely regress it by preventing
sync compaction to run which is fundamental to make the movable
pageblock really movable. This further improves it by making sure sync
compaction is always run (see patch 4/8) but it also reduces the
migration overhead (see patch 5/8).

I think it'd be good to test 5/8 and see if it reduces the stalls a
bit with the usb stick writes. Probably it won't be enough but it
still worth a try and it sounds good idea anyway.

The direction in allowing async compaction to migrate all type of
pages (so we don't screw the movable pageblock) is good, but the
version you posted had blocker bugs so I think this is safer as a
start and you can hack it on top of this if you want. Until we have
that working really well, the __GFP_NO_KSWAPD patch isn't good idea
IMHO.

[PATCH 1/8] mm: compaction: Allow compaction to isolate dirty pages
[PATCH 2/8] mm: compaction: Use synchronous compaction for /proc/sys/vm/compact_memory
[PATCH 3/8] mm: check if we isolated a compound page during lumpy scan
[PATCH 4/8] mm: compaction: defer compaction only with sync_migration
[PATCH 5/8] mm: compaction: avoid overwork in migrate sync mode
[PATCH 6/8] Revert "mm: compaction: make isolate_lru_page() filter-aware"
[PATCH 7/8] Revert "vmscan: abort reclaim/compaction if compaction can proceed"
[PATCH 8/8] Revert "vmscan: limit direct reclaim for higher order allocations"

btw, my aa tree on git.kernel.org includes the above.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/8] mm: compaction: Allow compaction to isolate dirty pages
  2011-11-18 16:58 [RFC PATCH 0/5] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v3 Mel Gorman
                   ` (5 preceding siblings ...)
  2011-11-19 19:54 ` [RFC PATCH 0/5] Reduce compaction-related stalls Andrea Arcangeli
@ 2011-11-19 19:54 ` Andrea Arcangeli
  2011-11-19 19:54 ` [PATCH 2/8] mm: compaction: Use synchronous compaction for /proc/sys/vm/compact_memory Andrea Arcangeli
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 40+ messages in thread
From: Andrea Arcangeli @ 2011-11-19 19:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Minchan Kim, Jan Kara, Andy Isaacson,
	Johannes Weiner, linux-kernel

From: Mel Gorman <mgorman@suse.de>

Commit [39deaf85: mm: compaction: make isolate_lru_page() filter-aware]
noted that compaction does not migrate dirty or writeback pages and
that is was meaningless to pick the page and re-add it to the LRU list.

What was missed during review is that asynchronous migration moves
dirty pages if their ->migratepage callback is migrate_page() because
these can be moved without blocking. This potentially impacted
hugepage allocation success rates by a factor depending on how many
dirty pages are in the system.

This patch partially reverts 39deaf85 to allow migration to isolate
dirty pages again. This increases how much compaction disrupts the
LRU but that is addressed later in the series.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/compaction.c |    3 ---
 1 files changed, 0 insertions(+), 3 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 899d956..237560e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -349,9 +349,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 			continue;
 		}

-		if (!cc->sync)
-			mode |= ISOLATE_CLEAN;
-
 		/* Try isolate the page */
 		if (__isolate_lru_page(page, mode, 0) != 0)
 			continue;

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/8] mm: compaction: Use synchronous compaction for /proc/sys/vm/compact_memory
  2011-11-18 16:58 [RFC PATCH 0/5] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v3 Mel Gorman
                   ` (6 preceding siblings ...)
  2011-11-19 19:54 ` [PATCH 1/8] mm: compaction: Allow compaction to isolate dirty pages Andrea Arcangeli
@ 2011-11-19 19:54 ` Andrea Arcangeli
  2011-11-19 19:54 ` [PATCH 3/8] mm: check if we isolated a compound page during lumpy scan Andrea Arcangeli
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 40+ messages in thread
From: Andrea Arcangeli @ 2011-11-19 19:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Minchan Kim, Jan Kara, Andy Isaacson,
	Johannes Weiner, linux-kernel

From: Mel Gorman <mgorman@suse.de>

When asynchronous compaction was introduced, the
/proc/sys/vm/compact_memory handler should have been updated to always
use synchronous compaction. This did not happen so this patch addresses
it. The assumption is if a user writes to /proc/sys/vm/compact_memory,
they are willing for that process to stall.

Signed-off-by: Mel Gorman <mgorman@suse.de>
Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/compaction.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 237560e..615502b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -666,6 +666,7 @@ static int compact_node(int nid)
 			.nr_freepages = 0,
 			.nr_migratepages = 0,
 			.order = -1,
+			.sync = true,
 		};
 
 		zone = &pgdat->node_zones[zoneid];

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 3/8] mm: check if we isolated a compound page during lumpy scan
  2011-11-18 16:58 [RFC PATCH 0/5] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v3 Mel Gorman
                   ` (7 preceding siblings ...)
  2011-11-19 19:54 ` [PATCH 2/8] mm: compaction: Use synchronous compaction for /proc/sys/vm/compact_memory Andrea Arcangeli
@ 2011-11-19 19:54 ` Andrea Arcangeli
  2011-11-21 11:51   ` Mel Gorman
  2011-11-19 19:54 ` [PATCH 4/8] mm: compaction: defer compaction only with sync_migration Andrea Arcangeli
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2011-11-19 19:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Minchan Kim, Jan Kara, Andy Isaacson,
	Johannes Weiner, linux-kernel

Properly take into account if we isolated a compound page during the
lumpy scan in reclaim and break the loop if we've isolated enough.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/vmscan.c |    9 ++++++---
 1 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a1893c0..3421746 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1183,13 +1183,16 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
 				break;
 
 			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
+				unsigned int isolated_pages;
 				list_move(&cursor_page->lru, dst);
 				mem_cgroup_del_lru(cursor_page);
-				nr_taken += hpage_nr_pages(page);
-				nr_lumpy_taken++;
+				isolated_pages = hpage_nr_pages(page);
+				nr_taken += isolated_pages;
+				nr_lumpy_taken += isolated_pages;
 				if (PageDirty(cursor_page))
-					nr_lumpy_dirty++;
+					nr_lumpy_dirty += isolated_pages;
 				scan++;
+				pfn += isolated_pages-1;
 			} else {
 				/*
 				 * Check if the page is freed already.

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 4/8] mm: compaction: defer compaction only with sync_migration
  2011-11-18 16:58 [RFC PATCH 0/5] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v3 Mel Gorman
                   ` (8 preceding siblings ...)
  2011-11-19 19:54 ` [PATCH 3/8] mm: check if we isolated a compound page during lumpy scan Andrea Arcangeli
@ 2011-11-19 19:54 ` Andrea Arcangeli
  2011-11-21 12:36   ` Mel Gorman
  2011-11-19 19:54 ` [PATCH 5/8] mm: compaction: avoid overwork in migrate sync mode Andrea Arcangeli
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2011-11-19 19:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Minchan Kim, Jan Kara, Andy Isaacson,
	Johannes Weiner, linux-kernel

Let only sync migration drive the
compaction_deferred()/defer_compaction() logic. So sync migration
isn't prevented to run if async migration fails. Without sync
migration pages requiring migrate.c:writeout() or a ->migratepage
operation (that isn't migrate_page) can't me migrated, and that has
the effect of polluting the movable pageblock with pages that won't be
migrated by async migration, so it's fundamental to guarantee sync
compaction will be run too before failing.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/page_alloc.c |   50 ++++++++++++++++++++++++++++++--------------------
 1 files changed, 30 insertions(+), 20 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9dd443d..2229f7d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1891,7 +1891,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 {
 	struct page *page;
 
-	if (!order || compaction_deferred(preferred_zone))
+	if (!order)
 		return NULL;
 
 	current->flags |= PF_MEMALLOC;
@@ -1921,7 +1921,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		 * but not enough to satisfy watermarks.
 		 */
 		count_vm_event(COMPACTFAIL);
-		defer_compaction(preferred_zone);
+		if (sync_migration)
+			defer_compaction(preferred_zone);
 
 		cond_resched();
 	}
@@ -2083,7 +2084,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	int alloc_flags;
 	unsigned long pages_reclaimed = 0;
 	unsigned long did_some_progress;
-	bool sync_migration = false;
+	bool sync_migration = false, defer_compaction;
 
 	/*
 	 * In the slowpath, we sanity check order to avoid ever trying to
@@ -2160,15 +2161,20 @@ rebalance:
 	 * Try direct compaction. The first pass is asynchronous. Subsequent
 	 * attempts after direct reclaim are synchronous
 	 */
-	page = __alloc_pages_direct_compact(gfp_mask, order,
-					zonelist, high_zoneidx,
-					nodemask,
-					alloc_flags, preferred_zone,
-					migratetype, &did_some_progress,
-					sync_migration);
-	if (page)
-		goto got_pg;
-	sync_migration = true;
+	defer_compaction = compaction_deferred(preferred_zone);
+	if (!defer_compaction) {
+		page = __alloc_pages_direct_compact(gfp_mask, order,
+						    zonelist, high_zoneidx,
+						    nodemask,
+						    alloc_flags,
+						    preferred_zone,
+						    migratetype,
+						    &did_some_progress,
+						    sync_migration);
+		if (page)
+			goto got_pg;
+		sync_migration = true;
+	}
 
 	/* Try direct reclaim and then allocating */
 	page = __alloc_pages_direct_reclaim(gfp_mask, order,
@@ -2223,19 +2229,23 @@ rebalance:
 		wait_iff_congested(preferred_zone, BLK_RW_ASYNC, HZ/50);
 		goto rebalance;
 	} else {
-		/*
-		 * High-order allocations do not necessarily loop after
-		 * direct reclaim and reclaim/compaction depends on compaction
-		 * being called after reclaim so call directly if necessary
-		 */
-		page = __alloc_pages_direct_compact(gfp_mask, order,
+		if (!defer_compaction) {
+			/*
+			 * High-order allocations do not necessarily
+			 * loop after direct reclaim and
+			 * reclaim/compaction depends on compaction
+			 * being called after reclaim so call directly
+			 * if necessary
+			 */
+			page = __alloc_pages_direct_compact(gfp_mask, order,
 					zonelist, high_zoneidx,
 					nodemask,
 					alloc_flags, preferred_zone,
 					migratetype, &did_some_progress,
 					sync_migration);
-		if (page)
-			goto got_pg;
+			if (page)
+				goto got_pg;
+		}
 	}
 
 nopage:

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 5/8] mm: compaction: avoid overwork in migrate sync mode
  2011-11-18 16:58 [RFC PATCH 0/5] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v3 Mel Gorman
                   ` (9 preceding siblings ...)
  2011-11-19 19:54 ` [PATCH 4/8] mm: compaction: defer compaction only with sync_migration Andrea Arcangeli
@ 2011-11-19 19:54 ` Andrea Arcangeli
  2011-11-21 21:59   ` Rik van Riel
  2011-11-19 19:54 ` [PATCH 6/8] Revert "mm: compaction: make isolate_lru_page() filter-aware" Andrea Arcangeli
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2011-11-19 19:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Minchan Kim, Jan Kara, Andy Isaacson,
	Johannes Weiner, linux-kernel

Add a lightweight sync migration (sync == 2) mode that avoids overwork
so more suitable to be used by compaction to provide lower latency but
still write throttling. It's unclear if this makes a lot of difference
or if we must make async migration more reliable, but looping 10 times
for compaction sounds excessive. 3 passes async, and 2 passes sync
should be more than enough to get good reliability of migrate when
invoked by sync compaction. Async compaction then runs 3 passes
async.

The sync + force flags were mostly overlapping and with almost the
same meaning so this removes the "sync" flag and keeps the "force"
flag from most migrate functions. Practically the only benefit of the
sync flag was to not retry on writeback pages if invoked through async
migration, but if we retry on the locked pages we can as well retry on
the writeback pages. It should be more worthwhile to retry only 3
times instead of 10 times, if we want to save CPU, than to avoid
retrying on the writeback pages only (but keep retrying on pinned and
locked pages). The locked pages are more likely to become unlocked
more quickly, the pinned pages too, but the benefit of not retrying on
writeback pages shouldn't be significant enough to require a special
flag.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 include/linux/migrate.h |    8 ++--
 mm/compaction.c         |    2 +-
 mm/memory-failure.c     |    2 +-
 mm/memory_hotplug.c     |    2 +-
 mm/mempolicy.c          |    4 +-
 mm/migrate.c            |   95 +++++++++++++++++++++++++++++++----------------
 6 files changed, 72 insertions(+), 41 deletions(-)

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index e39aeec..f26fc0e 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -14,10 +14,10 @@ extern int migrate_page(struct address_space *,
 			struct page *, struct page *);
 extern int migrate_pages(struct list_head *l, new_page_t x,
 			unsigned long private, bool offlining,
-			bool sync);
+			int sync);
 extern int migrate_huge_pages(struct list_head *l, new_page_t x,
 			unsigned long private, bool offlining,
-			bool sync);
+			int sync);
 
 extern int fail_migrate_page(struct address_space *,
 			struct page *, struct page *);
@@ -36,10 +36,10 @@ extern int migrate_huge_page_move_mapping(struct address_space *mapping,
 static inline void putback_lru_pages(struct list_head *l) {}
 static inline int migrate_pages(struct list_head *l, new_page_t x,
 		unsigned long private, bool offlining,
-		bool sync) { return -ENOSYS; }
+		int sync) { return -ENOSYS; }
 static inline int migrate_huge_pages(struct list_head *l, new_page_t x,
 		unsigned long private, bool offlining,
-		bool sync) { return -ENOSYS; }
+		int sync) { return -ENOSYS; }
 
 static inline int migrate_prep(void) { return -ENOSYS; }
 static inline int migrate_prep_local(void) { return -ENOSYS; }
diff --git a/mm/compaction.c b/mm/compaction.c
index 615502b..9a7fbf5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -552,7 +552,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
 		nr_migrate = cc->nr_migratepages;
 		err = migrate_pages(&cc->migratepages, compaction_alloc,
 				(unsigned long)cc, false,
-				cc->sync);
+				cc->sync ? 2 : 0);
 		update_nr_listpages(cc);
 		nr_remaining = cc->nr_migratepages;
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 06d3479..d8a41d3 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1557,7 +1557,7 @@ int soft_offline_page(struct page *page, int flags)
 					    page_is_file_cache(page));
 		list_add(&page->lru, &pagelist);
 		ret = migrate_pages(&pagelist, new_page, MPOL_MF_MOVE_ALL,
-								0, true);
+				    false, 1);
 		if (ret) {
 			putback_lru_pages(&pagelist);
 			pr_info("soft offline: %#lx: migration failed %d, type %lx\n",
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2168489..e1d6176 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -809,7 +809,7 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 		}
 		/* this function returns # of failed pages */
 		ret = migrate_pages(&source, hotremove_migrate_alloc, 0,
-								true, true);
+				    true, 1);
 		if (ret)
 			putback_lru_pages(&source);
 	}
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index adc3954..0bf88ed 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -933,7 +933,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
 
 	if (!list_empty(&pagelist)) {
 		err = migrate_pages(&pagelist, new_node_page, dest,
-								false, true);
+				    false, 1);
 		if (err)
 			putback_lru_pages(&pagelist);
 	}
@@ -1154,7 +1154,7 @@ static long do_mbind(unsigned long start, unsigned long len,
 		if (!list_empty(&pagelist)) {
 			nr_failed = migrate_pages(&pagelist, new_vma_page,
 						(unsigned long)vma,
-						false, true);
+						false, 1);
 			if (nr_failed)
 				putback_lru_pages(&pagelist);
 		}
diff --git a/mm/migrate.c b/mm/migrate.c
index 578e291..612c3ba 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -564,7 +564,7 @@ static int fallback_migrate_page(struct address_space *mapping,
  *  == 0 - success
  */
 static int move_to_new_page(struct page *newpage, struct page *page,
-					int remap_swapcache, bool sync)
+			    int remap_swapcache, bool force)
 {
 	struct address_space *mapping;
 	int rc;
@@ -588,11 +588,11 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 		rc = migrate_page(mapping, newpage, page);
 	else {
 		/*
-		 * Do not writeback pages if !sync and migratepage is
+		 * Do not writeback pages if !force and migratepage is
 		 * not pointing to migrate_page() which is nonblocking
 		 * (swapcache/tmpfs uses migratepage = migrate_page).
 		 */
-		if (PageDirty(page) && !sync &&
+		if (PageDirty(page) && !force &&
 		    mapping->a_ops->migratepage != migrate_page)
 			rc = -EBUSY;
 		else if (mapping->a_ops->migratepage)
@@ -622,7 +622,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 }
 
 static int __unmap_and_move(struct page *page, struct page *newpage,
-				int force, bool offlining, bool sync)
+			    bool force, bool offlining)
 {
 	int rc = -EAGAIN;
 	int remap_swapcache = 1;
@@ -631,7 +631,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	struct anon_vma *anon_vma = NULL;
 
 	if (!trylock_page(page)) {
-		if (!force || !sync)
+		if (!force)
 			goto out;
 
 		/*
@@ -676,14 +676,6 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 	BUG_ON(charge);
 
 	if (PageWriteback(page)) {
-		/*
-		 * For !sync, there is no point retrying as the retry loop
-		 * is expected to be too short for PageWriteback to be cleared
-		 */
-		if (!sync) {
-			rc = -EBUSY;
-			goto uncharge;
-		}
 		if (!force)
 			goto uncharge;
 		wait_on_page_writeback(page);
@@ -751,7 +743,7 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 
 skip_unmap:
 	if (!page_mapped(page))
-		rc = move_to_new_page(newpage, page, remap_swapcache, sync);
+		rc = move_to_new_page(newpage, page, remap_swapcache, force);
 
 	if (rc && remap_swapcache)
 		remove_migration_ptes(page, page);
@@ -774,7 +766,7 @@ out:
  * to the newly allocated page in newpage.
  */
 static int unmap_and_move(new_page_t get_new_page, unsigned long private,
-			struct page *page, int force, bool offlining, bool sync)
+			  struct page *page, bool force, bool offlining)
 {
 	int rc = 0;
 	int *result = NULL;
@@ -792,7 +784,7 @@ static int unmap_and_move(new_page_t get_new_page, unsigned long private,
 		if (unlikely(split_huge_page(page)))
 			goto out;
 
-	rc = __unmap_and_move(page, newpage, force, offlining, sync);
+	rc = __unmap_and_move(page, newpage, force, offlining);
 out:
 	if (rc != -EAGAIN) {
 		/*
@@ -840,7 +832,7 @@ out:
  */
 static int unmap_and_move_huge_page(new_page_t get_new_page,
 				unsigned long private, struct page *hpage,
-				int force, bool offlining, bool sync)
+				bool force, bool offlining)
 {
 	int rc = 0;
 	int *result = NULL;
@@ -853,7 +845,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 	rc = -EAGAIN;
 
 	if (!trylock_page(hpage)) {
-		if (!force || !sync)
+		if (!force)
 			goto out;
 		lock_page(hpage);
 	}
@@ -864,7 +856,7 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 	try_to_unmap(hpage, TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS);
 
 	if (!page_mapped(hpage))
-		rc = move_to_new_page(new_hpage, hpage, 1, sync);
+		rc = move_to_new_page(new_hpage, hpage, 1, force);
 
 	if (rc)
 		remove_migration_ptes(hpage, hpage);
@@ -907,11 +899,11 @@ out:
  */
 int migrate_pages(struct list_head *from,
 		new_page_t get_new_page, unsigned long private, bool offlining,
-		bool sync)
+		int sync)
 {
 	int retry = 1;
 	int nr_failed = 0;
-	int pass = 0;
+	int pass, passes;
 	struct page *page;
 	struct page *page2;
 	int swapwrite = current->flags & PF_SWAPWRITE;
@@ -920,15 +912,34 @@ int migrate_pages(struct list_head *from,
 	if (!swapwrite)
 		current->flags |= PF_SWAPWRITE;
 
-	for(pass = 0; pass < 10 && retry; pass++) {
+	switch (sync) {
+	case 0:
+		/* 3 async  0 sync */
+		passes = 3;
+		break;
+	case 1:
+		/* 3 async, 7 sync */
+		passes = 10;
+		break;
+	case 2:
+		/*
+		 * sync = 2 asks for a lightweight synchronous mode:
+		 * 3 async, 2 sync.
+		 */
+		passes = 5;
+		break;
+	default:
+		BUG();
+	}
+
+	for(pass = 0; pass < passes && retry; pass++) {
 		retry = 0;
 
 		list_for_each_entry_safe(page, page2, from, lru) {
 			cond_resched();
 
-			rc = unmap_and_move(get_new_page, private,
-						page, pass > 2, offlining,
-						sync);
+			rc = unmap_and_move(get_new_page, private, page,
+					    pass > 2, offlining);
 
 			switch(rc) {
 			case -ENOMEM:
@@ -958,24 +969,44 @@ out:
 
 int migrate_huge_pages(struct list_head *from,
 		new_page_t get_new_page, unsigned long private, bool offlining,
-		bool sync)
+		int sync)
 {
 	int retry = 1;
 	int nr_failed = 0;
-	int pass = 0;
+	int pass, passes;
 	struct page *page;
 	struct page *page2;
 	int rc;
 
-	for (pass = 0; pass < 10 && retry; pass++) {
+	switch (sync) {
+	case 0:
+		/* 3 async  0 sync */
+		passes = 3;
+		break;
+	case 1:
+		/* 3 async, 7 sync */
+		passes = 10;
+		break;
+	case 2:
+		/*
+		 * sync = 2 asks for a lightweight synchronous mode:
+		 * 3 async, 2 sync.
+		 */
+		passes = 5;
+		break;
+	default:
+		BUG();
+	}
+
+	for (pass = 0; pass < passes && retry; pass++) {
 		retry = 0;
 
 		list_for_each_entry_safe(page, page2, from, lru) {
 			cond_resched();
 
-			rc = unmap_and_move_huge_page(get_new_page,
-					private, page, pass > 2, offlining,
-					sync);
+			rc = unmap_and_move_huge_page(get_new_page, private,
+						      page, pass > 2,
+						      offlining);
 
 			switch(rc) {
 			case -ENOMEM:
@@ -1104,7 +1135,7 @@ set_status:
 	err = 0;
 	if (!list_empty(&pagelist)) {
 		err = migrate_pages(&pagelist, new_page_node,
-				(unsigned long)pm, 0, true);
+				    (unsigned long)pm, false, 1);
 		if (err)
 			putback_lru_pages(&pagelist);
 	}

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 6/8] Revert "mm: compaction: make isolate_lru_page() filter-aware"
  2011-11-18 16:58 [RFC PATCH 0/5] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v3 Mel Gorman
                   ` (10 preceding siblings ...)
  2011-11-19 19:54 ` [PATCH 5/8] mm: compaction: avoid overwork in migrate sync mode Andrea Arcangeli
@ 2011-11-19 19:54 ` Andrea Arcangeli
  2011-11-21 12:57   ` Mel Gorman
  2011-11-19 19:54 ` [PATCH 7/8] Revert "vmscan: abort reclaim/compaction if compaction can proceed" Andrea Arcangeli
  2011-11-19 19:54 ` [PATCH 8/8] Revert "vmscan: limit direct reclaim for higher order allocations" Andrea Arcangeli
  13 siblings, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2011-11-19 19:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Minchan Kim, Jan Kara, Andy Isaacson,
	Johannes Weiner, linux-kernel

This reverts commit
39deaf8585152f1a35c1676d3d7dc6ae0fb65967.

PageDirty is non blocking for compaction (unlike for
mm/vmscan.c:may_writepage) so async compaction should include it.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/compaction.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 9a7fbf5..83bf33f 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -261,7 +261,6 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 	unsigned long last_pageblock_nr = 0, pageblock_nr;
 	unsigned long nr_scanned = 0, nr_isolated = 0;
 	struct list_head *migratelist = &cc->migratepages;
-	isolate_mode_t mode = ISOLATE_ACTIVE|ISOLATE_INACTIVE;
 
 	/* Do not scan outside zone boundaries */
 	low_pfn = max(cc->migrate_pfn, zone->zone_start_pfn);
@@ -350,7 +349,8 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 		}
 
 		/* Try isolate the page */
-		if (__isolate_lru_page(page, mode, 0) != 0)
+		if (__isolate_lru_page(page,
+				ISOLATE_ACTIVE|ISOLATE_INACTIVE, 0) != 0)
 			continue;
 
 		VM_BUG_ON(PageTransCompound(page));

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 7/8] Revert "vmscan: abort reclaim/compaction if compaction can proceed"
  2011-11-18 16:58 [RFC PATCH 0/5] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v3 Mel Gorman
                   ` (11 preceding siblings ...)
  2011-11-19 19:54 ` [PATCH 6/8] Revert "mm: compaction: make isolate_lru_page() filter-aware" Andrea Arcangeli
@ 2011-11-19 19:54 ` Andrea Arcangeli
  2011-11-21 13:09   ` Mel Gorman
  2011-11-19 19:54 ` [PATCH 8/8] Revert "vmscan: limit direct reclaim for higher order allocations" Andrea Arcangeli
  13 siblings, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2011-11-19 19:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Minchan Kim, Jan Kara, Andy Isaacson,
	Johannes Weiner, linux-kernel

This reverts commit e0c23279c9f800c403f37511484d9014ac83adec.

If reclaim runs with an high order allocation, it means compaction
failed. That means something went wrong with compaction so we can't
stop reclaim too. We can't assume it failed and was deferred because
of the too low watermarks in compaction_suitable only, it may have
failed for other reasons.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/vmscan.c |   32 +++++++++++---------------------
 1 files changed, 11 insertions(+), 21 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3421746..b1a3cb0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2106,19 +2106,14 @@ restart:
  *
  * If a zone is deemed to be full of pinned pages then just give it a light
  * scan then give up on it.
- *
- * This function returns true if a zone is being reclaimed for a costly
- * high-order allocation and compaction is either ready to begin or deferred.
- * This indicates to the caller that it should retry the allocation or fail.
  */
-static bool shrink_zones(int priority, struct zonelist *zonelist,
+static void shrink_zones(int priority, struct zonelist *zonelist,
 					struct scan_control *sc)
 {
 	struct zoneref *z;
 	struct zone *zone;
 	unsigned long nr_soft_reclaimed;
 	unsigned long nr_soft_scanned;
-	bool should_abort_reclaim = false;
 
 	for_each_zone_zonelist_nodemask(zone, z, zonelist,
 					gfp_zone(sc->gfp_mask), sc->nodemask) {
@@ -2135,20 +2130,19 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
 				continue;	/* Let kswapd poll it */
 			if (COMPACTION_BUILD) {
 				/*
-				 * If we already have plenty of memory free for
-				 * compaction in this zone, don't free any more.
-				 * Even though compaction is invoked for any
-				 * non-zero order, only frequent costly order
-				 * reclamation is disruptive enough to become a
-				 * noticable problem, like transparent huge page
-				 * allocations.
+				 * If we already have plenty of memory
+				 * free for compaction, don't free any
+				 * more.  Even though compaction is
+				 * invoked for any non-zero order,
+				 * only frequent costly order
+				 * reclamation is disruptive enough to
+				 * become a noticable problem, like
+				 * transparent huge page allocations.
 				 */
 				if (sc->order > PAGE_ALLOC_COSTLY_ORDER &&
 					(compaction_suitable(zone, sc->order) ||
-					 compaction_deferred(zone))) {
-					should_abort_reclaim = true;
+					 compaction_deferred(zone)))
 					continue;
-				}
 			}
 			/*
 			 * This steals pages from memory cgroups over softlimit
@@ -2167,8 +2161,6 @@ static bool shrink_zones(int priority, struct zonelist *zonelist,
 
 		shrink_zone(priority, zone, sc);
 	}
-
-	return should_abort_reclaim;
 }
 
 static bool zone_reclaimable(struct zone *zone)
@@ -2233,9 +2225,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		sc->nr_scanned = 0;
 		if (!priority)
 			disable_swap_token(sc->mem_cgroup);
-		if (shrink_zones(priority, zonelist, sc))
-			break;
-
+		shrink_zones(priority, zonelist, sc);
 		/*
 		 * Don't shrink slabs when reclaiming memory from
 		 * over limit cgroups

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 8/8] Revert "vmscan: limit direct reclaim for higher order allocations"
  2011-11-18 16:58 [RFC PATCH 0/5] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v3 Mel Gorman
                   ` (12 preceding siblings ...)
  2011-11-19 19:54 ` [PATCH 7/8] Revert "vmscan: abort reclaim/compaction if compaction can proceed" Andrea Arcangeli
@ 2011-11-19 19:54 ` Andrea Arcangeli
  2011-11-21 21:57   ` Rik van Riel
  13 siblings, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2011-11-19 19:54 UTC (permalink / raw)
  To: linux-mm
  Cc: Mel Gorman, Minchan Kim, Jan Kara, Andy Isaacson,
	Johannes Weiner, linux-kernel

This reverts commit e0887c19b2daa140f20ca8104bdc5740f39dbb86.

If reclaim runs with an high order allocation, it means compaction
failed. That means something went wrong with compaction so we can't
stop reclaim too. We can't assume it failed and was deferred only
because of the too low watermarks in compaction_suitable, it may have
failed for other reasons.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 mm/vmscan.c |   16 ----------------
 1 files changed, 0 insertions(+), 16 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index b1a3cb0..a9d1ba4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2128,22 +2128,6 @@ static void shrink_zones(int priority, struct zonelist *zonelist,
 				continue;
 			if (zone->all_unreclaimable && priority != DEF_PRIORITY)
 				continue;	/* Let kswapd poll it */
-			if (COMPACTION_BUILD) {
-				/*
-				 * If we already have plenty of memory
-				 * free for compaction, don't free any
-				 * more.  Even though compaction is
-				 * invoked for any non-zero order,
-				 * only frequent costly order
-				 * reclamation is disruptive enough to
-				 * become a noticable problem, like
-				 * transparent huge page allocations.
-				 */
-				if (sc->order > PAGE_ALLOC_COSTLY_ORDER &&
-					(compaction_suitable(zone, sc->order) ||
-					 compaction_deferred(zone)))
-					continue;
-			}
 			/*
 			 * This steals pages from memory cgroups over softlimit
 			 * and returns the number of reclaimed pages and

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/5] mm: compaction: Determine if dirty pages can be migreated without blocking within ->migratepage
  2011-11-18 21:35   ` Andrea Arcangeli
@ 2011-11-21 11:17     ` Mel Gorman
  2011-11-21 22:45       ` Andrea Arcangeli
  0 siblings, 1 reply; 40+ messages in thread
From: Mel Gorman @ 2011-11-21 11:17 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linux-MM, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, LKML

On Fri, Nov 18, 2011 at 10:35:30PM +0100, Andrea Arcangeli wrote:
> On Fri, Nov 18, 2011 at 04:58:43PM +0000, Mel Gorman wrote:
> > +	/* async case, we cannot block on lock_buffer so use trylock_buffer */
> > +	do {
> > +		get_bh(bh);
> > +		if (!trylock_buffer(bh)) {
> > +			/*
> > +			 * We failed to lock the buffer and cannot stall in
> > +			 * async migration. Release the taken locks
> > +			 */
> > +			struct buffer_head *failed_bh = bh;
> > +			bh = head;
> > +			do {
> > +				unlock_buffer(bh);
> > +				put_bh(bh);
> > +				bh = bh->b_this_page;
> > +			} while (bh != failed_bh);
> > +			return false;
> 
> here if blocksize is < PAGE_SIZE you're leaking one get_bh
> (memleak). If blocksize is PAGE_SIZE (common) you're unlocking a
> locked bh leading to fs corruption.

Well spotted, should be easily. Thanks.

> > +	if (!buffer_migrate_lock_buffers(head, sync)) {
> > +		/*
> > +		 * We have to revert the radix tree update. If this returns
> > +		 * non-zero, it either means that the page count changed
> > +		 * which "can't happen" or the slot changed from underneath
> > +		 * us in which case someone operated on a page that did not
> > +		 * have buffers fully migrated which is alarming so warn
> > +		 * that it happened.
> > +		 */
> > +		WARN_ON(migrate_page_move_mapping(mapping, page, newpage));
> 
> speculative pagecache lookups can actually increase the count, the
> freezing is released before returning from
> migrate_page_move_mapping. It's not alarming that pagecache lookup
> flips bit all over the place. The only way to stop them is the
> page_freeze_refs.
> 

You're right, speculative pagecache lookup complicates things. If the
backout case encounters a page with an elevated count, there is not
much it can do other than block until that reference has been dropped.
Even then, the backout case would be a bit of a mess.

One alternative option is for migrate_page_move_mapping to use lock
the buffers with trylock while the page is frozen and before the
slot is updated in the async case and bail if the buffers cannot be
locked. I am including an updated patch below.

> folks who wants low latency or no memory overhead should simply
> disable compaction.

That strikes me as being somewhat heavy handed. Compaction should be as
low latency as possible.

> In my tests these "lowlatency" changes, notably
> the change in vmscan that is already upstream breaks thp allocation
> reliability,

There might be some confusion on what commits were for. Commit
[e0887c19: vmscan: limit direct reclaim for higher order allocations]
was not about low latency but more about reclaim/compaction reclaiming
too much memory. IIRC, Rik's main problem was that there was too much
memory free on his machine when THP was enabled.

> the __GFP_NO_KSWAPD check too should be dropped I think,

Only if we can get rid of the major stalls. I haven't looked closely at
your series yet but I'll be searching for a replacment for patch 3 of
this series in it.

> it's good thing we dropped it because the sync migrate is needed or
> the above pages with bh to migrate would become "unmovable" despite
> they're allocated in "movable" pageblocks.
> 
> The workload to test is:
> 
> cp /dev/sda /dev/null &
> cp /dev/zero /media/someusb/zero &
> wait free memory to reach minimum level
> ./largepage (allocate some gigabyte of hugepages)
> grep thp /proc/vmstat
> 

Ok. It's not even close to what I was testing but I can move to this
test so we're looking at the same thing for allocation success rates.

> Anything that leads to a thp allocation failure rate of this workload
> of 50% should be banned and all compaction patches (including vmscan
> changes) should go through the above workload.
> 
> I got back to the previous state and there's <10% of failures even in
> the above workload (and close to 100% in normal load but it's harder
> to define normal load while the above is pretty easy to define).

Here is an updated patch that allows more dirty pages to be migrated by
async compation.

==== CUT HERE ====
mm: compaction: Determine if dirty pages can be migrated without blocking within ->migratepage

Asynchronous compaction is when allocating transparent hugepages to
avoid blocking for long periods of time. Due to reports of stalling,
synchronous compaction is never used but this impacts allocation
success rates. When deciding whether to migrate dirty pages, the
following check is made

	if (PageDirty(page) && !sync &&
		mapping->a_ops->migratepage != migrate_page)
			rc = -EBUSY;

This skips over all pages using buffer_migrate_page() even though
it is possible to migrate some of these pages without blocking. This
patch updates the ->migratepage callback with a "sync" parameter. It
is the resposibility of the callback to gracefully fail migration of
the page if it cannot be achieved without blocking.

Signed-off-by: Mel Gorman <mgorman@suse.de>
---
 fs/btrfs/disk-io.c      |    2 +-
 fs/nfs/internal.h       |    2 +-
 fs/nfs/write.c          |    4 +-
 include/linux/fs.h      |    9 ++-
 include/linux/migrate.h |    2 +-
 mm/migrate.c            |  127 +++++++++++++++++++++++++++++++++--------------
 6 files changed, 101 insertions(+), 45 deletions(-)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 62afe5c..f841f00 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -872,7 +872,7 @@ static int btree_submit_bio_hook(struct inode *inode, int rw, struct bio *bio,
 
 #ifdef CONFIG_MIGRATION
 static int btree_migratepage(struct address_space *mapping,
-			struct page *newpage, struct page *page)
+			struct page *newpage, struct page *page, bool sync)
 {
 	/*
 	 * we can't safely write a btree page from here,
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index c1a1bd8..d0c460f 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -328,7 +328,7 @@ void nfs_commit_release_pages(struct nfs_write_data *data);
 
 #ifdef CONFIG_MIGRATION
 extern int nfs_migrate_page(struct address_space *,
-		struct page *, struct page *);
+		struct page *, struct page *, bool);
 #else
 #define nfs_migrate_page NULL
 #endif
diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 1dda78d..33475df 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -1711,7 +1711,7 @@ out_error:
 
 #ifdef CONFIG_MIGRATION
 int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
-		struct page *page)
+		struct page *page, bool sync)
 {
 	/*
 	 * If PagePrivate is set, then the page is currently associated with
@@ -1726,7 +1726,7 @@ int nfs_migrate_page(struct address_space *mapping, struct page *newpage,
 
 	nfs_fscache_release_page(page, GFP_KERNEL);
 
-	return migrate_page(mapping, newpage, page);
+	return migrate_page(mapping, newpage, page, sync);
 }
 #endif
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 0c4df26..034cffb 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -609,9 +609,12 @@ struct address_space_operations {
 			loff_t offset, unsigned long nr_segs);
 	int (*get_xip_mem)(struct address_space *, pgoff_t, int,
 						void **, unsigned long *);
-	/* migrate the contents of a page to the specified target */
+	/*
+	 * migrate the contents of a page to the specified target. If sync
+	 * is false, it must not block.
+	 */
 	int (*migratepage) (struct address_space *,
-			struct page *, struct page *);
+			struct page *, struct page *, bool);
 	int (*launder_page) (struct page *);
 	int (*is_partially_uptodate) (struct page *, read_descriptor_t *,
 					unsigned long);
@@ -2577,7 +2580,7 @@ extern int generic_check_addressable(unsigned, u64);
 
 #ifdef CONFIG_MIGRATION
 extern int buffer_migrate_page(struct address_space *,
-				struct page *, struct page *);
+				struct page *, struct page *, bool);
 #else
 #define buffer_migrate_page NULL
 #endif
diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index e39aeec..14e6d2a 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -11,7 +11,7 @@ typedef struct page *new_page_t(struct page *, unsigned long private, int **);
 
 extern void putback_lru_pages(struct list_head *l);
 extern int migrate_page(struct address_space *,
-			struct page *, struct page *);
+			struct page *, struct page *, bool);
 extern int migrate_pages(struct list_head *l, new_page_t x,
 			unsigned long private, bool offlining,
 			bool sync);
diff --git a/mm/migrate.c b/mm/migrate.c
index 578e291..f93bfad 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -220,6 +220,54 @@ out:
 	pte_unmap_unlock(ptep, ptl);
 }
 
+#ifdef CONFIG_BLOCK
+/* Returns true if all buffers are successfully locked */
+static bool buffer_migrate_lock_buffers(struct buffer_head *head, bool sync)
+{
+	struct buffer_head *bh = head;
+
+	/* Simple case, sync compaction */
+	if (sync) {
+		do {
+			get_bh(bh);
+			lock_buffer(bh);
+			bh = bh->b_this_page;
+
+		} while (bh != head);
+
+		return true;
+	}
+
+	/* async case, we cannot block on lock_buffer so use trylock_buffer */
+	do {
+		get_bh(bh);
+		if (!trylock_buffer(bh)) {
+			/*
+			 * We failed to lock the buffer and cannot stall in
+			 * async migration. Release the taken locks
+			 */
+			struct buffer_head *failed_bh = bh;
+			put_bh(failed_bh);
+			bh = head;
+			while (bh != failed_bh) {
+				unlock_buffer(bh);
+				put_bh(bh);
+				bh = bh->b_this_page;
+			}
+			return false;
+		}
+
+		bh = bh->b_this_page;
+	} while (bh != head);
+	return true;
+}
+#else
+static inline bool buffer_migrate_lock_buffers(struct buffer_head *head, bool sync)
+{
+	return true;
+}
+#endif /* CONFIG_BLOCK */
+
 /*
  * Replace the page in the mapping.
  *
@@ -229,7 +277,8 @@ out:
  * 3 for pages with a mapping and PagePrivate/PagePrivate2 set.
  */
 static int migrate_page_move_mapping(struct address_space *mapping,
-		struct page *newpage, struct page *page)
+		struct page *newpage, struct page *page,
+		struct buffer_head *head, bool sync)
 {
 	int expected_count;
 	void **pslot;
@@ -259,6 +308,19 @@ static int migrate_page_move_mapping(struct address_space *mapping,
 	}
 
 	/*
+	 * In the async migration case of moving a page with buffers, lock the
+	 * buffers using trylock before the mapping is moved. If the mapping
+	 * was moved, we later failed to lock the buffers and could not move
+	 * the mapping back due to an elevated page count, we would have to
+	 * block waiting on other references to be dropped.
+	 */
+	if (!sync && head && !buffer_migrate_lock_buffers(head, sync)) {
+		page_unfreeze_refs(page, expected_count);
+		spin_unlock_irq(&mapping->tree_lock);
+		return -EAGAIN;
+	}
+	
+	/*
 	 * Now we know that no one else is looking at the page.
 	 */
 	get_page(newpage);	/* add cache reference */
@@ -415,13 +477,13 @@ EXPORT_SYMBOL(fail_migrate_page);
  * Pages are locked upon entry and exit.
  */
 int migrate_page(struct address_space *mapping,
-		struct page *newpage, struct page *page)
+		struct page *newpage, struct page *page, bool sync)
 {
 	int rc;
 
 	BUG_ON(PageWriteback(page));	/* Writeback must be complete */
 
-	rc = migrate_page_move_mapping(mapping, newpage, page);
+	rc = migrate_page_move_mapping(mapping, newpage, page, NULL, sync);
 
 	if (rc)
 		return rc;
@@ -438,28 +500,27 @@ EXPORT_SYMBOL(migrate_page);
  * exist.
  */
 int buffer_migrate_page(struct address_space *mapping,
-		struct page *newpage, struct page *page)
+		struct page *newpage, struct page *page, bool sync)
 {
 	struct buffer_head *bh, *head;
 	int rc;
 
 	if (!page_has_buffers(page))
-		return migrate_page(mapping, newpage, page);
+		return migrate_page(mapping, newpage, page, sync);
 
 	head = page_buffers(page);
 
-	rc = migrate_page_move_mapping(mapping, newpage, page);
+	rc = migrate_page_move_mapping(mapping, newpage, page, head, sync);
 
 	if (rc)
 		return rc;
 
-	bh = head;
-	do {
-		get_bh(bh);
-		lock_buffer(bh);
-		bh = bh->b_this_page;
-
-	} while (bh != head);
+	/* In the async case, migrate_page_move_mapping locked the buffers
+	 * with an IRQ-safe spinlock held. In the sync case, the buffers
+	 * need to be locked now
+	 */
+	if (sync)
+		BUG_ON(!buffer_migrate_lock_buffers(head, sync));
 
 	ClearPagePrivate(page);
 	set_page_private(newpage, page_private(page));
@@ -536,10 +597,13 @@ static int writeout(struct address_space *mapping, struct page *page)
  * Default handling if a filesystem does not provide a migration function.
  */
 static int fallback_migrate_page(struct address_space *mapping,
-	struct page *newpage, struct page *page)
+	struct page *newpage, struct page *page, bool sync)
 {
-	if (PageDirty(page))
+	if (PageDirty(page)) {
+		if (!sync)
+			return -EBUSY;
 		return writeout(mapping, page);
+	}
 
 	/*
 	 * Buffers may be managed in a filesystem specific way.
@@ -549,7 +613,7 @@ static int fallback_migrate_page(struct address_space *mapping,
 	    !try_to_release_page(page, GFP_KERNEL))
 		return -EAGAIN;
 
-	return migrate_page(mapping, newpage, page);
+	return migrate_page(mapping, newpage, page, sync);
 }
 
 /*
@@ -585,29 +649,18 @@ static int move_to_new_page(struct page *newpage, struct page *page,
 
 	mapping = page_mapping(page);
 	if (!mapping)
-		rc = migrate_page(mapping, newpage, page);
-	else {
+		rc = migrate_page(mapping, newpage, page, sync);
+	else if (mapping->a_ops->migratepage)
 		/*
-		 * Do not writeback pages if !sync and migratepage is
-		 * not pointing to migrate_page() which is nonblocking
-		 * (swapcache/tmpfs uses migratepage = migrate_page).
+		 * Most pages have a mapping and most filesystems provide a
+		 * migratepage callback. Anonymous pages are part of swap
+		 * space which also has its own migratepage callback. This
+		 * is the most common path for page migration.
 		 */
-		if (PageDirty(page) && !sync &&
-		    mapping->a_ops->migratepage != migrate_page)
-			rc = -EBUSY;
-		else if (mapping->a_ops->migratepage)
-			/*
-			 * Most pages have a mapping and most filesystems
-			 * should provide a migration function. Anonymous
-			 * pages are part of swap space which also has its
-			 * own migration function. This is the most common
-			 * path for page migration.
-			 */
-			rc = mapping->a_ops->migratepage(mapping,
-							newpage, page);
-		else
-			rc = fallback_migrate_page(mapping, newpage, page);
-	}
+		rc = mapping->a_ops->migratepage(mapping,
+						newpage, page, sync);
+	else
+		rc = fallback_migrate_page(mapping, newpage, page, sync);
 
 	if (rc) {
 		newpage->mapping = NULL;



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/5] mm: compaction: Determine if dirty pages can be migreated without blocking within ->migratepage
  2011-11-19  8:59   ` Nai Xia
  2011-11-19  9:48     ` Nai Xia
@ 2011-11-21 11:19     ` Mel Gorman
  1 sibling, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2011-11-21 11:19 UTC (permalink / raw)
  To: Nai Xia
  Cc: Linux-MM, Andrea Arcangeli, Minchan Kim, Jan Kara, Andy Isaacson,
	Johannes Weiner, LKML

On Sat, Nov 19, 2011 at 04:59:10PM +0800, Nai Xia wrote:
> > <SNIP>
> > @@ -453,13 +494,18 @@ int buffer_migrate_page(struct address_space *mapping,
> >        if (rc)
> >                return rc;
> >
> > -       bh = head;
> > -       do {
> > -               get_bh(bh);
> > -               lock_buffer(bh);
> > -               bh = bh->b_this_page;
> > -
> > -       } while (bh != head);
> > +       if (!buffer_migrate_lock_buffers(head, sync)) {
> > +               /*
> > +                * We have to revert the radix tree update. If this returns
> > +                * non-zero, it either means that the page count changed
> > +                * which "can't happen" or the slot changed from underneath
> > +                * us in which case someone operated on a page that did not
> > +                * have buffers fully migrated which is alarming so warn
> > +                * that it happened.
> > +                */
> > +               WARN_ON(migrate_page_move_mapping(mapping, page, newpage));
> > +               return -EBUSY;
> 
> If this migrate_page_move_mapping() really fails, seems disk IO will be needed
> to bring the previously already cached page back,

Aside from that, I couldn't see a way of handling the case where the
page had an elevated count due to a speculative lookup.

> I wonder if we should make the
> double check for the two conditions of "page refs is ok " and "all bh
> trylocked"
> before doing radix_tree_replace_slot() ? which I think does not
> involve IO on the
> error path.
> 

I reached the same conclusion when figuring out how to backout of the
the elevated page count case. In an updated patch,
migrate_page_move_mapping() returns with buffers locked in the async
case.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 3/8] mm: check if we isolated a compound page during lumpy scan
  2011-11-19 19:54 ` [PATCH 3/8] mm: check if we isolated a compound page during lumpy scan Andrea Arcangeli
@ 2011-11-21 11:51   ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2011-11-21 11:51 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner,
	linux-kernel

On Sat, Nov 19, 2011 at 08:54:15PM +0100, Andrea Arcangeli wrote:
> Properly take into account if we isolated a compound page during the
> lumpy scan in reclaim and break the loop if we've isolated enough.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
>  1 files changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a1893c0..3421746 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1183,13 +1183,16 @@ static unsigned long isolate_lru_pages(unsigned long nr_to_scan,
>  				break;
>  
>  			if (__isolate_lru_page(cursor_page, mode, file) == 0) {
> +				unsigned int isolated_pages;
>  				list_move(&cursor_page->lru, dst);
>  				mem_cgroup_del_lru(cursor_page);
> -				nr_taken += hpage_nr_pages(page);

nr_taken was already being updated correctly.

> -				nr_lumpy_taken++;
> +				isolated_pages = hpage_nr_pages(page);
> +				nr_taken += isolated_pages;
> +				nr_lumpy_taken += isolated_pages;

nr_lumpy_taken was not, and this patch corrects it.

>  				if (PageDirty(cursor_page))
> -					nr_lumpy_dirty++;
> +					nr_lumpy_dirty += isolated_pages;

nr_lumpy_dirty was not, and this patch corrects it.

However, the nr_lumpy_* variables here are not of critical importance
as they only are used by a trace point. Fixing them is nice but
functionally changes nothing.

>  				scan++;
> +				pfn += isolated_pages-1;

This is more important. With the current code the next page encountered
after a THP page is isolated will be a tail page, not on the LRU
and will cause the loop to break as __isolate_lru_page will return
EINVAL. With this change, the next PFN encountered will really be
the next PFN of interest.

That said, the impact of this change is low. For THP allocations,
pfn += isolated_pages-1 will bring pfn past end_pfn so with or without
the page, we break the loop after isolating a THP. For lower-order
allocations, pfn+= isolated_pages-1 will also bring pfn past end_pfn.
This patch does help the case where the allocation is for a page larger
than a THP but that is very rare.

Hence, while I think the patch is correct, the changelog is misleading
as it does not have a large impact on breaking the loop if we've
isolated enough. To really break if we've isolated enough, you'd also
need 

scan += isolated_pages-1

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/8] mm: compaction: defer compaction only with sync_migration
  2011-11-19 19:54 ` [PATCH 4/8] mm: compaction: defer compaction only with sync_migration Andrea Arcangeli
@ 2011-11-21 12:36   ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2011-11-21 12:36 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner,
	linux-kernel

On Sat, Nov 19, 2011 at 08:54:16PM +0100, Andrea Arcangeli wrote:
> Let only sync migration drive the
> compaction_deferred()/defer_compaction() logic. So sync migration
> isn't prevented to run if async migration fails. Without sync
> migration pages requiring migrate.c:writeout() or a ->migratepage
> operation (that isn't migrate_page) can't me migrated, and that has
> the effect of polluting the movable pageblock with pages that won't be
> migrated by async migration, so it's fundamental to guarantee sync
> compaction will be run too before failing.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  mm/page_alloc.c |   50 ++++++++++++++++++++++++++++++--------------------
>  1 files changed, 30 insertions(+), 20 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9dd443d..2229f7d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1891,7 +1891,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  {
>  	struct page *page;
>  
> -	if (!order || compaction_deferred(preferred_zone))
> +	if (!order)
>  		return NULL;
>  

What is the motivation for moving the compation_deferred()
check to __alloc_pages_slowpath()? If compaction was deferred
for async compaction, we try direct reclaim as the linear isolation
might succeed where compaction failed and compaction will likely be
skipped again the second time around.

If anything, entering direct reclaim for THP when compaction is deferred
is wrong as it also potentially stalls for a long period of time
in reclaim


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 6/8] Revert "mm: compaction: make isolate_lru_page() filter-aware"
  2011-11-19 19:54 ` [PATCH 6/8] Revert "mm: compaction: make isolate_lru_page() filter-aware" Andrea Arcangeli
@ 2011-11-21 12:57   ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2011-11-21 12:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner,
	linux-kernel

On Sat, Nov 19, 2011 at 08:54:18PM +0100, Andrea Arcangeli wrote:
> This reverts commit
> 39deaf8585152f1a35c1676d3d7dc6ae0fb65967.
> 
> PageDirty is non blocking for compaction (unlike for
> mm/vmscan.c:may_writepage) so async compaction should include it.
> 

It blocks if fallback_migrate_page() is used which happens if the
underlying filesystem does not support ->migratepage.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 7/8] Revert "vmscan: abort reclaim/compaction if compaction can proceed"
  2011-11-19 19:54 ` [PATCH 7/8] Revert "vmscan: abort reclaim/compaction if compaction can proceed" Andrea Arcangeli
@ 2011-11-21 13:09   ` Mel Gorman
  2011-11-21 15:37     ` Rik van Riel
  0 siblings, 1 reply; 40+ messages in thread
From: Mel Gorman @ 2011-11-21 13:09 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Minchan Kim, Jan Kara, Andy Isaacson, Rik van Riel,
	Johannes Weiner, linux-kernel

On Sat, Nov 19, 2011 at 08:54:19PM +0100, Andrea Arcangeli wrote:
> This reverts commit e0c23279c9f800c403f37511484d9014ac83adec.
> 
> If reclaim runs with an high order allocation, it means compaction
> failed. That means something went wrong with compaction so we can't
> stop reclaim too. We can't assume it failed and was deferred because
> of the too low watermarks in compaction_suitable only, it may have
> failed for other reasons.
> 

When Rik was testing with THP enabled, he found that there was way
too much memory free on his machine. The problem was that THP caused
reclaim to be too aggressive and that's what led to that pair of
patches. While I do not think it was confirmed, the expectation was
that the performance of workloads whose working set size was close
to total physical RAM and mostly filesystem-backed files would suffer
if THP was enabled.

In other words, reverting these patches needs to be a last resort.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 7/8] Revert "vmscan: abort reclaim/compaction if compaction can proceed"
  2011-11-21 13:09   ` Mel Gorman
@ 2011-11-21 15:37     ` Rik van Riel
  0 siblings, 0 replies; 40+ messages in thread
From: Rik van Riel @ 2011-11-21 15:37 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, linux-mm, Minchan Kim, Jan Kara, Andy Isaacson,
	Johannes Weiner, linux-kernel

On 11/21/2011 08:09 AM, Mel Gorman wrote:
> On Sat, Nov 19, 2011 at 08:54:19PM +0100, Andrea Arcangeli wrote:
>> This reverts commit e0c23279c9f800c403f37511484d9014ac83adec.
>>
>> If reclaim runs with an high order allocation, it means compaction
>> failed. That means something went wrong with compaction so we can't
>> stop reclaim too. We can't assume it failed and was deferred because
>> of the too low watermarks in compaction_suitable only, it may have
>> failed for other reasons.
>>
>
> When Rik was testing with THP enabled, he found that there was way
> too much memory free on his machine.

Agreed, without these patches, I saw up to about 4GB
of my 12GB memory being freed by pageout activity,
despite the programs in my system only taking about
10GB anonymous memory.

Needless to say, this completely killed system
performance, by constantly pushing everything into
swap and keeping 10-30% of memory free constantly.

This revert makes no sense at all.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/5] mm: compaction: Allow compaction to isolate dirty pages
  2011-11-18 16:58 ` [PATCH 1/5] mm: compaction: Allow compaction to isolate dirty pages Mel Gorman
  2011-11-18 17:28   ` Andrea Arcangeli
@ 2011-11-21 17:16   ` Rik van Riel
  1 sibling, 0 replies; 40+ messages in thread
From: Rik van Riel @ 2011-11-21 17:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Andrea Arcangeli, Minchan Kim, Jan Kara, Andy Isaacson,
	Johannes Weiner, LKML

On 11/18/2011 11:58 AM, Mel Gorman wrote:
> Commit [39deaf85: mm: compaction: make isolate_lru_page() filter-aware]
> noted that compaction does not migrate dirty or writeback pages and
> that is was meaningless to pick the page and re-add it to the LRU list.
>
> What was missed during review is that asynchronous migration moves
> dirty pages if their ->migratepage callback is migrate_page() because
> these can be moved without blocking. This potentially impacted
> hugepage allocation success rates by a factor depending on how many
> dirty pages are in the system.
>
> This patch partially reverts 39deaf85 to allow migration to isolate
> dirty pages again. This increases how much compaction disrupts the
> LRU but that is addressed later in the series.
>
> Signed-off-by: Mel Gorman<mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/5] mm: compaction: Use synchronous compaction for /proc/sys/vm/compact_memory
  2011-11-18 16:58 ` [PATCH 2/5] mm: compaction: Use synchronous compaction for /proc/sys/vm/compact_memory Mel Gorman
  2011-11-18 17:27   ` Andrea Arcangeli
@ 2011-11-21 21:46   ` Rik van Riel
  1 sibling, 0 replies; 40+ messages in thread
From: Rik van Riel @ 2011-11-21 21:46 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Andrea Arcangeli, Minchan Kim, Jan Kara, Andy Isaacson,
	Johannes Weiner, LKML

On 11/18/2011 11:58 AM, Mel Gorman wrote:
> When asynchronous compaction was introduced, the
> /proc/sys/vm/compact_memory handler should have been updated to always
> use synchronous compaction. This did not happen so this patch addresses
> it. The assumption is if a user writes to /proc/sys/vm/compact_memory,
> they are willing for that process to stall.
>
> Signed-off-by: Mel Gorman<mgorman@suse.de>

Reviewed-by: Rik van Riel <riel@redhat.com>


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 8/8] Revert "vmscan: limit direct reclaim for higher order allocations"
  2011-11-19 19:54 ` [PATCH 8/8] Revert "vmscan: limit direct reclaim for higher order allocations" Andrea Arcangeli
@ 2011-11-21 21:57   ` Rik van Riel
  0 siblings, 0 replies; 40+ messages in thread
From: Rik van Riel @ 2011-11-21 21:57 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Mel Gorman, Minchan Kim, Jan Kara, Andy Isaacson,
	Johannes Weiner, linux-kernel

On 11/19/2011 02:54 PM, Andrea Arcangeli wrote:
> This reverts commit e0887c19b2daa140f20ca8104bdc5740f39dbb86.
>
> If reclaim runs with an high order allocation, it means compaction
> failed. That means something went wrong with compaction so we can't
> stop reclaim too. We can't assume it failed and was deferred only
> because of the too low watermarks in compaction_suitable, it may have
> failed for other reasons.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>

NACK

Reverting this can lead to the situation where every time
we have an attempted THP allocation, we free 4MB more
memory.

This has led to systems with 1/4 to 1/3 of all memory free
and pushed to swap, while the system continues with swapout
activity.

The thrashing this causes can be a factor 10 or worse
performance penalty.  Failing a THP allocation is merely
a 10-20% performance penalty, which is not as much of an
issue.

We can move the threshold at which we skip pageout to be a
little higher (to give compaction more space to work with),
and even call shrink_slab when we skip other reclaiming
(because slab cannot be moved by compaction), but whatever
we do we do need to ensure that we never reclaim an unreasonable
amount of memory and end up pushing the working set into swap.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 5/8] mm: compaction: avoid overwork in migrate sync mode
  2011-11-19 19:54 ` [PATCH 5/8] mm: compaction: avoid overwork in migrate sync mode Andrea Arcangeli
@ 2011-11-21 21:59   ` Rik van Riel
  2011-11-22  9:51     ` Mel Gorman
  0 siblings, 1 reply; 40+ messages in thread
From: Rik van Riel @ 2011-11-21 21:59 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: linux-mm, Mel Gorman, Minchan Kim, Jan Kara, Andy Isaacson,
	Johannes Weiner, linux-kernel

On 11/19/2011 02:54 PM, Andrea Arcangeli wrote:
> Add a lightweight sync migration (sync == 2) mode that avoids overwork
> so more suitable to be used by compaction to provide lower latency but

> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -552,7 +552,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
>   		nr_migrate = cc->nr_migratepages;
>   		err = migrate_pages(&cc->migratepages, compaction_alloc,
>   				(unsigned long)cc, false,
> -				cc->sync);
> +				cc->sync ? 2 : 0);

Great idea, but it would be good if these numbers got
a symbolic name so people trying to learn the code can
figure it out a little easier.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/5] mm: compaction: Determine if dirty pages can be migreated without blocking within ->migratepage
  2011-11-21 11:17     ` Mel Gorman
@ 2011-11-21 22:45       ` Andrea Arcangeli
  2011-11-22  0:55         ` [PATCH] mm: compaction: make buffer cache __GFP_MOVABLE Rik van Riel
  2011-11-22 12:59         ` [PATCH 4/5] mm: compaction: Determine if dirty pages can be migreated without blocking within ->migratepage Mel Gorman
  0 siblings, 2 replies; 40+ messages in thread
From: Andrea Arcangeli @ 2011-11-21 22:45 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, LKML

On Mon, Nov 21, 2011 at 11:17:26AM +0000, Mel Gorman wrote:
> On Fri, Nov 18, 2011 at 10:35:30PM +0100, Andrea Arcangeli wrote:
> > folks who wants low latency or no memory overhead should simply
> > disable compaction.
> 
> That strikes me as being somewhat heavy handed. Compaction should be as
> low latency as possible.

Yes I was meaning in the very short term. Optimizations are always
possible :) we've just to sort out some issues (as previous part of
the email discussed).

> There might be some confusion on what commits were for. Commit
> [e0887c19: vmscan: limit direct reclaim for higher order allocations]
> was not about low latency but more about reclaim/compaction reclaiming
> too much memory. IIRC, Rik's main problem was that there was too much
> memory free on his machine when THP was enabled.
> 
> > the __GFP_NO_KSWAPD check too should be dropped I think,
> 
> Only if we can get rid of the major stalls. I haven't looked closely at
> your series yet but I'll be searching for a replacment for patch 3 of
> this series in it.

I reduced the migrate loops, for both async and sync compactions. I
doubt it'll be very effective but it may help a bit.

Also this one I also suggest it in the short term.

I mean until async migrate can deal with all type of pages (the issues
you're trying to fix) the __GFP_NO_KSWAPD check would not be reliable
enough as part of the movable zone wouldn't be movable. It'd defeat
the reliability from the movable pageblock in compaction context. And
I doubt a more advanced async compaction will be ok for 3.2, so I
don't think 3.2 should have the __GFP_NO_KSWAPD and I tend to back
Andrew's argument. My patch OTOH that only reduces the loops and
doesn't alter the movable pageblock semantics in compaction context,
sounds safer. It won't help equally well though.

> Ok. It's not even close to what I was testing but I can move to this
> test so we're looking at the same thing for allocation success rates.

Note I guess we also need the below. This also should fix by practical
means Rik's trouble (he was using KVM without O_DIRECT on raw
blkdev). That explains why he experienced too much reclaim, the VM had
no choice but to do reclaim because the blkdev cache was not staying
in the movable pageblocks preventing compaction effectiveness (and
likely they used lots of ram).

We may still have to limit reclaim but not like the patch that went
upstream implements. When compaction_suitable is happy about the
wmarks, the compaction loop isn't still as reliable as it could be in
the movable zone, and doing more reclaim helps a lot, the more free
pages the more compaction goes ahead and has a change to compact more
stuff before the two scans meets in the middle. With Rik we thought
about a way to do just a no-swapping reclaim to shrink caches like
slab.

I mean the cp /dev/sda /dev/null scenario that happens without the
below patch can still materialize in a fileserver with very large slab
caches. We still have to invoke reclaim, maybe not too heavy, we
shouldn't end up swapping, basically it should be a light reclaim (the
opposite of lumpy reclaim). But with the current check upstream
reclaim practically won't run at all if compaction fails, so it's
normal the success rate goes down a lot. (btw I didn't verify yet at
runtime if migration succeeds on the blkdev pagecache)

====
mm: block_dev pagecache is movable

Allow block_dev pagecache to go in movable pageblocks. If compaction
fails reclaim is invoked. So with raw devices being used without
O_DIRECT, reclaim would run too much and because the cache pages would
get mixed with slab and other not movable entities, the compaction
reliability would decrease.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
---
 fs/block_dev.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/fs/block_dev.c b/fs/block_dev.c
index b07f1da..f7111c4 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -565,7 +565,7 @@ struct block_device *bdget(dev_t dev)
 		inode->i_rdev = dev;
 		inode->i_bdev = bdev;
 		inode->i_data.a_ops = &def_blk_aops;
-		mapping_set_gfp_mask(&inode->i_data, GFP_USER);
+		mapping_set_gfp_mask(&inode->i_data, GFP_USER|__GFP_MOVABLE);
 		inode->i_data.backing_dev_info = &default_backing_dev_info;
 		spin_lock(&bdev_lock);
 		list_add(&bdev->bd_list, &all_bdevs);
===

> +		if (!trylock_buffer(bh)) {
> +			/*
> +			 * We failed to lock the buffer and cannot stall in
> +			 * async migration. Release the taken locks
> +			 */
> +			struct buffer_head *failed_bh = bh;
> +			put_bh(failed_bh);
> +			bh = head;
> +			while (bh != failed_bh) {
> +				unlock_buffer(bh);
> +				put_bh(bh);
> +				bh = bh->b_this_page;
> +			}
> +			return false;

Ok!

> +	/* In the async case, migrate_page_move_mapping locked the buffers
> +	 * with an IRQ-safe spinlock held. In the sync case, the buffers
> +	 * need to be locked now
> +	 */
> +	if (sync)
> +		BUG_ON(!buffer_migrate_lock_buffers(head, sync));

I seem to recall Andrew said we're ok with this now, but I generally
still prefer stuff that shouldn't be optimized away to be outside of
the BUG checks. No big deal.

Patch looks better now thanks, I'll try to do a closer review tomorrow
and test it.

Thanks!

^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH] mm: compaction: make buffer cache __GFP_MOVABLE
  2011-11-21 22:45       ` Andrea Arcangeli
@ 2011-11-22  0:55         ` Rik van Riel
  2011-11-22 12:59         ` [PATCH 4/5] mm: compaction: Determine if dirty pages can be migreated without blocking within ->migratepage Mel Gorman
  1 sibling, 0 replies; 40+ messages in thread
From: Rik van Riel @ 2011-11-22  0:55 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Mel Gorman, Linux-MM, Minchan Kim, Jan Kara, Andy Isaacson,
	Johannes Weiner, LKML, linux-fsdevel

On 11/21/2011 05:45 PM, Andrea Arcangeli wrote:
> On Mon, Nov 21, 2011 at 11:17:26AM +0000, Mel Gorman wrote:
>> On Fri, Nov 18, 2011 at 10:35:30PM +0100, Andrea Arcangeli wrote:
>>> folks who wants low latency or no memory overhead should simply
>>> disable compaction.
>>
>> That strikes me as being somewhat heavy handed. Compaction should be as
>> low latency as possible.
>
> Yes I was meaning in the very short term. Optimizations are always
> possible :) we've just to sort out some issues (as previous part of
> the email discussed).
>
>> There might be some confusion on what commits were for. Commit
>> [e0887c19: vmscan: limit direct reclaim for higher order allocations]
>> was not about low latency but more about reclaim/compaction reclaiming
>> too much memory. IIRC, Rik's main problem was that there was too much
>> memory free on his machine when THP was enabled.
>>
>>> the __GFP_NO_KSWAPD check too should be dropped I think,
>>
>> Only if we can get rid of the major stalls. I haven't looked closely at
>> your series yet but I'll be searching for a replacment for patch 3 of
>> this series in it.
>
> I reduced the migrate loops, for both async and sync compactions. I
> doubt it'll be very effective but it may help a bit.
>
> Also this one I also suggest it in the short term.
>
> I mean until async migrate can deal with all type of pages (the issues
> you're trying to fix) the __GFP_NO_KSWAPD check would not be reliable
> enough as part of the movable zone wouldn't be movable. It'd defeat
> the reliability from the movable pageblock in compaction context. And
> I doubt a more advanced async compaction will be ok for 3.2, so I
> don't think 3.2 should have the __GFP_NO_KSWAPD and I tend to back
> Andrew's argument. My patch OTOH that only reduces the loops and
> doesn't alter the movable pageblock semantics in compaction context,
> sounds safer. It won't help equally well though.
>
>> Ok. It's not even close to what I was testing but I can move to this
>> test so we're looking at the same thing for allocation success rates.
>
> Note I guess we also need the below. This also should fix by practical
> means Rik's trouble (he was using KVM without O_DIRECT on raw
> blkdev). That explains why he experienced too much reclaim, the VM had
> no choice but to do reclaim because the blkdev cache was not staying
> in the movable pageblocks preventing compaction effectiveness (and
> likely they used lots of ram).
>
> We may still have to limit reclaim but not like the patch that went
> upstream implements. When compaction_suitable is happy about the
> wmarks, the compaction loop isn't still as reliable as it could be in
> the movable zone, and doing more reclaim helps a lot, the more free
> pages the more compaction goes ahead and has a change to compact more
> stuff before the two scans meets in the middle. With Rik we thought
> about a way to do just a no-swapping reclaim to shrink caches like
> slab.
>
> I mean the cp /dev/sda /dev/null scenario that happens without the
> below patch can still materialize in a fileserver with very large slab
> caches. We still have to invoke reclaim, maybe not too heavy, we
> shouldn't end up swapping, basically it should be a light reclaim (the
> opposite of lumpy reclaim). But with the current check upstream
> reclaim practically won't run at all if compaction fails, so it's
> normal the success rate goes down a lot. (btw I didn't verify yet at
> runtime if migration succeeds on the blkdev pagecache)
>
> ====
> mm: block_dev pagecache is movable
>
> Allow block_dev pagecache to go in movable pageblocks. If compaction
> fails reclaim is invoked. So with raw devices being used without
> O_DIRECT, reclaim would run too much and because the cache pages would
> get mixed with slab and other not movable entities, the compaction
> reliability would decrease.
>
> Signed-off-by: Andrea Arcangeli<aarcange@redhat.com>
> ---
>   fs/block_dev.c |    2 +-
>   1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index b07f1da..f7111c4 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -565,7 +565,7 @@ struct block_device *bdget(dev_t dev)
>   		inode->i_rdev = dev;
>   		inode->i_bdev = bdev;
>   		inode->i_data.a_ops =&def_blk_aops;
> -		mapping_set_gfp_mask(&inode->i_data, GFP_USER);
> +		mapping_set_gfp_mask(&inode->i_data, GFP_USER|__GFP_MOVABLE);
>   		inode->i_data.backing_dev_info =&default_backing_dev_info;
>   		spin_lock(&bdev_lock);
>   		list_add(&bdev->bd_list,&all_bdevs);

Reviewed-by: Rik van Riel <riel@redhat.com>

Maybe prettier done by defining a GFP_USER_MOVABLE in the
proper include file and using that?

Still, we need this fix.

I believe this is safe because all the users of buffer cache
pages already need to protect the page against getting
reclaimed (by taking a refcount), which also protects against
migration.

Only pages without an extra refcount (not in current use by
anyone) can be migrated.

I do not believe this change could break any filesystem, but
adding linux-fsdevel to the CC anyway just to check...

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 5/8] mm: compaction: avoid overwork in migrate sync mode
  2011-11-21 21:59   ` Rik van Riel
@ 2011-11-22  9:51     ` Mel Gorman
  0 siblings, 0 replies; 40+ messages in thread
From: Mel Gorman @ 2011-11-22  9:51 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Andrea Arcangeli, linux-mm, Minchan Kim, Jan Kara, Andy Isaacson,
	Johannes Weiner, linux-kernel

On Mon, Nov 21, 2011 at 04:59:36PM -0500, Rik van Riel wrote:
> On 11/19/2011 02:54 PM, Andrea Arcangeli wrote:
> >Add a lightweight sync migration (sync == 2) mode that avoids overwork
> >so more suitable to be used by compaction to provide lower latency but
> 
> >--- a/mm/compaction.c
> >+++ b/mm/compaction.c
> >@@ -552,7 +552,7 @@ static int compact_zone(struct zone *zone, struct compact_control *cc)
> >  		nr_migrate = cc->nr_migratepages;
> >  		err = migrate_pages(&cc->migratepages, compaction_alloc,
> >  				(unsigned long)cc, false,
> >-				cc->sync);
> >+				cc->sync ? 2 : 0);
> 
> Great idea, but it would be good if these numbers got
> a symbolic name so people trying to learn the code can
> figure it out a little easier.
> 

I took the bulk of this patch and gave them symbolic names when trying
to reconcile the two series. I didn't take all this patch such as
varying the number of passes because even if that turns out to be of
benefit, it should be a separate patch.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/5] mm: compaction: Determine if dirty pages can be migreated without blocking within ->migratepage
  2011-11-21 22:45       ` Andrea Arcangeli
  2011-11-22  0:55         ` [PATCH] mm: compaction: make buffer cache __GFP_MOVABLE Rik van Riel
@ 2011-11-22 12:59         ` Mel Gorman
  2011-11-24  1:19           ` Andrea Arcangeli
  1 sibling, 1 reply; 40+ messages in thread
From: Mel Gorman @ 2011-11-22 12:59 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linux-MM, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, LKML

On Mon, Nov 21, 2011 at 11:45:45PM +0100, Andrea Arcangeli wrote:
> On Mon, Nov 21, 2011 at 11:17:26AM +0000, Mel Gorman wrote:
> > On Fri, Nov 18, 2011 at 10:35:30PM +0100, Andrea Arcangeli wrote:
> > > folks who wants low latency or no memory overhead should simply
> > > disable compaction.
> > 
> > That strikes me as being somewhat heavy handed. Compaction should be as
> > low latency as possible.
> 
> Yes I was meaning in the very short term. Optimizations are always
> possible :) we've just to sort out some issues (as previous part of
> the email discussed).
> 

To put some figures on the latency impacts, I added a test to MM
Tests similar to yours that is in three parts

1. A process reads /dev/sda and writes to /dev/null
2. A process reads from /dev/zero and writes to a filesystem on a USB stick
3. When memory is full, a process starts that creates an anonymous
   mapping

http://www.csn.ul.ie/~mel/postings/compaction-20111122/writebackCPvfat/hydra/comparison.html
http://www.csn.ul.ie/~mel/postings/compaction-20111122/writebackCPext4/hydra/comparison.html

Looking at the vfat figures, we see

THPAVAIL
            thpavail-3.0.0-vanilla-thpavailthpavail-3.1.0-vanilla-thpavail    thpavail-3.2.0         3.2.0-rc2         3.2.0-rc2
                 3.0.0-vanilla     3.1.0-vanilla       rc2-vanilla migratedirty-v4r2    synclight-v4r2
System Time         1.89 (    0.00%)    1.90 (   -0.21%)    3.83 (  -50.63%)   19.95 (  -90.52%)   71.91 (  -97.37%)
+/-                 0.15 (    0.00%)    0.05 (  205.45%)    2.39 (  -93.87%)   27.59 (  -99.47%)    0.75 (  -80.49%)
User Time           0.10 (    0.00%)    0.10 (   -4.00%)    0.10 (   -7.69%)    0.12 (  -17.24%)    0.11 (   -9.43%)
+/-                 0.02 (    0.00%)    0.02 (  -24.28%)    0.03 (  -38.31%)    0.02 (    0.00%)    0.04 (  -56.57%)
Elapsed Time      986.53 (    0.00%) 1189.77 (  -17.08%)  589.48 (   67.36%)  506.45 (   94.79%)   73.39 ( 1244.27%)
+/-                35.52 (    0.00%)   90.09 (  -60.57%)   49.00 (  -27.51%)  213.56 (  -83.37%)    0.47 ( 7420.98%)
THP Active        118.80 (    0.00%)   89.20 (  -33.18%)   35.40 ( -235.59%)    8.00 (-1385.00%)   44.00 ( -170.00%)
+/-                65.57 (    0.00%)   35.37 (  -85.38%)   29.23 ( -124.33%)   12.51 ( -424.28%)   19.83 ( -230.65%)
Fault Alloc       308.80 (    0.00%)  244.20 (  -26.45%)   59.40 ( -419.87%)   81.00 ( -281.23%)   95.80 ( -222.34%)
+/-                81.62 (    0.00%)   80.53 (   -1.36%)   26.85 ( -203.98%)   67.85 (  -20.30%)   34.29 ( -138.05%)
Fault Fallback    697.20 (    0.00%)  761.60 (   -8.46%)  946.40 (  -26.33%)  924.80 (  -24.61%)  909.80 (  -23.37%)
+/-                81.62 (    0.00%)   80.52 (    1.37%)   26.65 (  206.28%)   67.93 (   20.16%)   34.25 (  138.33%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)        541.58    645.79   1220.88   1131.04    976.54
Total Elapsed Time (seconds)               5066.05   6083.58   3068.08   2653.11    493.96

Straight off, 3.0, 3.1 and 3.2-rc2 have a number of THPs in use. 3.0
had a rougly 30% success rate with 10% of the mapping using THP but
look at the cost in time. It took about 16 minutes per iteration (5
iterations which is why total elapsed time looks large) to fault in a
mapping the size of physical memory versus 1.5 minutes with my series
(which is still very slow).

In my series system time is stupidly high but I haven't run oprofile
to see exactly where. It could be because it's running compaction
because defer_compaction() is not being called at the right times
but it could also be because direct reclaim scanning is through the
roof albeit better than 3.2-rc2 vanilla. I suspect direct reclaim
scanning could be because we are not calling ->writepage in direct
reclaim any more - 3.0 wrote 178215 pages while my series wrote 5.

The story looks better for ext4 as it is not writing page pages in
fallback_migrate_page

THPAVAIL
            thpavail-3.0.0-vanilla-thpavailthpavail-3.1.0-vanilla-thpavail    thpavail-3.2.0         3.2.0-rc2         3.2.0-rc2
                 3.0.0-vanilla     3.1.0-vanilla       rc2-vanilla migratedirty-v4r2    synclight-v4r2
System Time         2.64 (    0.00%)    2.49 (    6.10%)    3.09 (  -14.38%)    4.29 (  -38.40%)   12.20 (  -78.33%)
+/-                 0.29 (    0.00%)    0.61 (  -51.80%)    1.17 (  -74.90%)    1.33 (  -78.04%)   16.27 (  -98.20%)
User Time           0.12 (    0.00%)    0.08 (   51.28%)    0.10 (   13.46%)    0.10 (   13.46%)    0.09 (   34.09%)
+/-                 0.03 (    0.00%)    0.02 (    6.30%)    0.01 (   94.49%)    0.01 (  158.69%)    0.03 (    0.00%)
Elapsed Time      107.76 (    0.00%)   64.42 (   67.29%)   60.89 (   76.99%)   30.90 (  248.69%)   50.72 (  112.45%)
+/-                39.83 (    0.00%)   17.03 (  133.90%)    8.75 (  355.08%)   17.20 (  131.65%)   22.87 (   74.14%)
THP Active         82.80 (    0.00%)   84.20 (    1.66%)   35.00 ( -136.57%)   53.00 (  -56.23%)   36.40 ( -127.47%)
+/-                81.42 (    0.00%)   80.82 (   -0.74%)   20.89 ( -289.73%)   34.50 ( -136.01%)   51.51 (  -58.07%)
Fault Alloc       246.00 (    0.00%)  292.60 (   15.93%)   66.20 ( -271.60%)  129.40 (  -90.11%)   90.20 ( -172.73%)
+/-               173.64 (    0.00%)  161.13 (   -7.76%)   21.30 ( -715.14%)   52.41 ( -231.32%)  104.14 (  -66.74%)
Fault Fallback    759.60 (    0.00%)  712.40 (    6.63%)  939.80 (  -19.17%)  876.20 (  -13.31%)  915.60 (  -17.04%)
+/-               173.49 (    0.00%)  161.13 (    7.67%)   21.30 (  714.44%)   52.16 (  232.59%)  104.00 (   66.82%)
MMTests Statistics: duration
User/Sys Time Running Test (seconds)         63.41     50.64     39.79      58.8       171
Total Elapsed Time (seconds)                732.79    483.23    447.17    335.28    412.21

THP availability is very variable between iterations and the stall
times are nowhere near as bad as with vfat but still, my series cuts
the time to fault in a mapping by half (and that is still very slow).

> > There might be some confusion on what commits were for. Commit
> > [e0887c19: vmscan: limit direct reclaim for higher order allocations]
> > was not about low latency but more about reclaim/compaction reclaiming
> > too much memory. IIRC, Rik's main problem was that there was too much
> > memory free on his machine when THP was enabled.
> > 
> > > the __GFP_NO_KSWAPD check too should be dropped I think,
> > 
> > Only if we can get rid of the major stalls. I haven't looked closely at
> > your series yet but I'll be searching for a replacment for patch 3 of
> > this series in it.
> 
> I reduced the migrate loops, for both async and sync compactions. I
> doubt it'll be very effective but it may help a bit.
> 

It may help the system times but lacking a profile, I think it's more
likely it's being spent in direct reclaim skipping over dirty pages.

> Also this one I also suggest it in the short term.
> 
> I mean until async migrate can deal with all type of pages (the issues
> you're trying to fix) the __GFP_NO_KSWAPD check would not be reliable
> enough as part of the movable zone wouldn't be movable. It'd defeat
> the reliability from the movable pageblock in compaction context. And
> I doubt a more advanced async compaction will be ok for 3.2, so I
> don't think 3.2 should have the __GFP_NO_KSWAPD and I tend to back
> Andrew's argument. My patch OTOH that only reduces the loops and
> doesn't alter the movable pageblock semantics in compaction context,
> sounds safer. It won't help equally well though.
> 

I don't think async compaction can or will deal with all types of
pages, but more of them can be dealt with.  In a revised series,
I keep sync compaction and took parts of your work to reduce the
latency (specifically sync-light but used symbolic names).

If you want, the __GFP_NO_KSWAPD check patch can be dropped as that
will also keep David happy until this series is fully baked if we
agree in principal that the current stalls are unacceptably large. It
will be a stretch to complete it in time for 3.2 though meaning that
stalls in 3.2 will be severe with THP enabled.

> > Ok. It's not even close to what I was testing but I can move to this
> > test so we're looking at the same thing for allocation success rates.
> 
> Note I guess we also need the below. This also should fix by practical
> means Rik's trouble (he was using KVM without O_DIRECT on raw
> blkdev). That explains why he experienced too much reclaim, the VM had
> no choice but to do reclaim because the blkdev cache was not staying
> in the movable pageblocks preventing compaction effectiveness (and
> likely they used lots of ram).
> 

Will comment on this patch below.

> We may still have to limit reclaim but not like the patch that went
> upstream implements.

That should be possible but again it will need to be balanced with
stall times.

> When compaction_suitable is happy about the
> wmarks, the compaction loop isn't still as reliable as it could be in
> the movable zone, and doing more reclaim helps a lot, the more free
> pages the more compaction goes ahead and has a change to compact more
> stuff before the two scans meets in the middle. With Rik we thought
> about a way to do just a no-swapping reclaim to shrink caches like
> slab.
> 

Ok, is there a patch? I'll investigate this later when I get the time
but in the shorter term, I'll be looking at why the direct reclaim
scan figures are so high.

> I mean the cp /dev/sda /dev/null scenario that happens without the
> below patch can still materialize in a fileserver with very large slab
> caches. We still have to invoke reclaim, maybe not too heavy, we
> shouldn't end up swapping, basically it should be a light reclaim (the
> opposite of lumpy reclaim). But with the current check upstream
> reclaim practically won't run at all if compaction fails, so it's
> normal the success rate goes down a lot. (btw I didn't verify yet at
> runtime if migration succeeds on the blkdev pagecache)
> 

This last "btw" is critical.

> ====
> mm: block_dev pagecache is movable
> 
> Allow block_dev pagecache to go in movable pageblocks. If compaction
> fails reclaim is invoked. So with raw devices being used without
> O_DIRECT, reclaim would run too much and because the cache pages would
> get mixed with slab and other not movable entities, the compaction
> reliability would decrease.
> 
> Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
> ---
>  fs/block_dev.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index b07f1da..f7111c4 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -565,7 +565,7 @@ struct block_device *bdget(dev_t dev)
>  		inode->i_rdev = dev;
>  		inode->i_bdev = bdev;
>  		inode->i_data.a_ops = &def_blk_aops;
> -		mapping_set_gfp_mask(&inode->i_data, GFP_USER);
> +		mapping_set_gfp_mask(&inode->i_data, GFP_USER|__GFP_MOVABLE);
>  		inode->i_data.backing_dev_info = &default_backing_dev_info;
>  		spin_lock(&bdev_lock);
>  		list_add(&bdev->bd_list, &all_bdevs);

This is not the first time this patch existed. Look at the first part of
the patch in http://comments.gmane.org/gmane.linux.kernel.mm/13850 .
This was back in 2007 and the bio_alloc part was totally bogus but this
part initially looked right.

This is a long time ago so my recollection is rare but I found at the
time that hte allocations tended to be short-lived so tests would
initially look good but were not movable.

A few months later I wrote
http://kerneltrap.org/mailarchive/linux-kernel/2007/5/17/92186 where
I commented "bdget() no longer uses mobility flags after this patch
because it is does not appear that any pages allocated on behalf of
the mapping are movable so it needs to be revisited separately." I
don't remember if I revisited it.

In retrospect, the main difference may be between raw access to the
device and when the block device backs a filesystem that is currently
mounted. While mounted, the filesystem may pin pages for metadata
access which would prevent them ever being migrated and make them
unsuitable for use with __GFP_MOVABLE.

In Rik's case, this might be less of an issue. It's using the
def_blk_aops as the address space operations that does not have a
->migratepage handler so minimally if they are dirty they have to be
written out meaning it will behave more like vfat than ext4 in terms
of compaction performance. I did not spot anything that would actually
pin the buffers and prevent them from being released though but would
prefer it was double checked.

> ===
> 
> 
> > +		if (!trylock_buffer(bh)) {
> > +			/*
> > +			 * We failed to lock the buffer and cannot stall in
> > +			 * async migration. Release the taken locks
> > +			 */
> > +			struct buffer_head *failed_bh = bh;
> > +			put_bh(failed_bh);
> > +			bh = head;
> > +			while (bh != failed_bh) {
> > +				unlock_buffer(bh);
> > +				put_bh(bh);
> > +				bh = bh->b_this_page;
> > +			}
> > +			return false;
> 
> Ok!
> 
> > +	/* In the async case, migrate_page_move_mapping locked the buffers
> > +	 * with an IRQ-safe spinlock held. In the sync case, the buffers
> > +	 * need to be locked now
> > +	 */
> > +	if (sync)
> > +		BUG_ON(!buffer_migrate_lock_buffers(head, sync));
> 
> I seem to recall Andrew said we're ok with this now, but I generally
> still prefer stuff that shouldn't be optimized away to be outside of
> the BUG checks. No big deal.
> 
> Patch looks better now thanks, I'll try to do a closer review tomorrow
> and test it.
> 

If you're going to test a series, can you look at the V4 I posted that
attempts to reconcile our two series?

Thanks.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/5] mm: compaction: Determine if dirty pages can be migreated without blocking within ->migratepage
  2011-11-22 12:59         ` [PATCH 4/5] mm: compaction: Determine if dirty pages can be migreated without blocking within ->migratepage Mel Gorman
@ 2011-11-24  1:19           ` Andrea Arcangeli
  2011-11-24 12:21             ` Mel Gorman
  0 siblings, 1 reply; 40+ messages in thread
From: Andrea Arcangeli @ 2011-11-24  1:19 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Linux-MM, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, LKML

On Tue, Nov 22, 2011 at 12:59:06PM +0000, Mel Gorman wrote:
> On Mon, Nov 21, 2011 at 11:45:45PM +0100, Andrea Arcangeli wrote:
> > On Mon, Nov 21, 2011 at 11:17:26AM +0000, Mel Gorman wrote:
> > > On Fri, Nov 18, 2011 at 10:35:30PM +0100, Andrea Arcangeli wrote:
> > > > folks who wants low latency or no memory overhead should simply
> > > > disable compaction.
> > > 
> > > That strikes me as being somewhat heavy handed. Compaction should be as
> > > low latency as possible.
> > 
> > Yes I was meaning in the very short term. Optimizations are always
> > possible :) we've just to sort out some issues (as previous part of
> > the email discussed).
> > 
> 
> To put some figures on the latency impacts, I added a test to MM
> Tests similar to yours that is in three parts
> 
> 1. A process reads /dev/sda and writes to /dev/null

Yes also note, ironically this is likely to be a better test for this
without the __GFP_MOVABLE in block_dev.c. Even if we want it fixed,
maybe another source that reduces the non movable pages may be needed then.

> 2. A process reads from /dev/zero and writes to a filesystem on a USB stick
> 3. When memory is full, a process starts that creates an anonymous
>    mapping

Great.

> Looking at the vfat figures, we see
> 
> THPAVAIL
>             thpavail-3.0.0-vanilla-thpavailthpavail-3.1.0-vanilla-thpavail    thpavail-3.2.0         3.2.0-rc2         3.2.0-rc2
>                  3.0.0-vanilla     3.1.0-vanilla       rc2-vanilla migratedirty-v4r2    synclight-v4r2
> System Time         1.89 (    0.00%)    1.90 (   -0.21%)    3.83 (  -50.63%)   19.95 (  -90.52%)   71.91 (  -97.37%)
> +/-                 0.15 (    0.00%)    0.05 (  205.45%)    2.39 (  -93.87%)   27.59 (  -99.47%)    0.75 (  -80.49%)
> User Time           0.10 (    0.00%)    0.10 (   -4.00%)    0.10 (   -7.69%)    0.12 (  -17.24%)    0.11 (   -9.43%)
> +/-                 0.02 (    0.00%)    0.02 (  -24.28%)    0.03 (  -38.31%)    0.02 (    0.00%)    0.04 (  -56.57%)
> Elapsed Time      986.53 (    0.00%) 1189.77 (  -17.08%)  589.48 (   67.36%)  506.45 (   94.79%)   73.39 ( 1244.27%)
> +/-                35.52 (    0.00%)   90.09 (  -60.57%)   49.00 (  -27.51%)  213.56 (  -83.37%)    0.47 ( 7420.98%)
> THP Active        118.80 (    0.00%)   89.20 (  -33.18%)   35.40 ( -235.59%)    8.00 (-1385.00%)   44.00 ( -170.00%)
> +/-                65.57 (    0.00%)   35.37 (  -85.38%)   29.23 ( -124.33%)   12.51 ( -424.28%)   19.83 ( -230.65%)
> Fault Alloc       308.80 (    0.00%)  244.20 (  -26.45%)   59.40 ( -419.87%)   81.00 ( -281.23%)   95.80 ( -222.34%)
> +/-                81.62 (    0.00%)   80.53 (   -1.36%)   26.85 ( -203.98%)   67.85 (  -20.30%)   34.29 ( -138.05%)
> Fault Fallback    697.20 (    0.00%)  761.60 (   -8.46%)  946.40 (  -26.33%)  924.80 (  -24.61%)  909.80 (  -23.37%)
> +/-                81.62 (    0.00%)   80.52 (    1.37%)   26.65 (  206.28%)   67.93 (   20.16%)   34.25 (  138.33%)
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)        541.58    645.79   1220.88   1131.04    976.54
> Total Elapsed Time (seconds)               5066.05   6083.58   3068.08   2653.11    493.96

Notice how the fault alloc goes down from 308-244 to 59 in
rc2-vanilla, that's the vmscan.c patch I backed out in my series... So
that explains why I backed out that bit. Ok the elapsed time improved
because reclaim is used less, but nothing close to x10 times faster
thanks to it, even for this workload and even on vfat, and if you add
__GFP_MOVABLE to block_dev.c things will likely change for the better
too. (It was x10 times slower if that lead to swapping, likely not the
case here, but again the __GFP_MOVABLE should take care of it)

> Straight off, 3.0, 3.1 and 3.2-rc2 have a number of THPs in use. 3.0
> had a rougly 30% success rate with 10% of the mapping using THP but
> look at the cost in time. It took about 16 minutes per iteration (5
> iterations which is why total elapsed time looks large) to fault in a
> mapping the size of physical memory versus 1.5 minutes with my series
> (which is still very slow).

I'd be curious if you tested my exact status too.

> In my series system time is stupidly high but I haven't run oprofile
> to see exactly where. It could be because it's running compaction
> because defer_compaction() is not being called at the right times
> but it could also be because direct reclaim scanning is through the
> roof albeit better than 3.2-rc2 vanilla. I suspect direct reclaim
> scanning could be because we are not calling ->writepage in direct
> reclaim any more - 3.0 wrote 178215 pages while my series wrote 5.
> 
> The story looks better for ext4 as it is not writing page pages in
> fallback_migrate_page

That certianly has an effect on the usb stick as they're all vfat.

> THPAVAIL
>             thpavail-3.0.0-vanilla-thpavailthpavail-3.1.0-vanilla-thpavail    thpavail-3.2.0         3.2.0-rc2         3.2.0-rc2
>                  3.0.0-vanilla     3.1.0-vanilla       rc2-vanilla migratedirty-v4r2    synclight-v4r2
> System Time         2.64 (    0.00%)    2.49 (    6.10%)    3.09 (  -14.38%)    4.29 (  -38.40%)   12.20 (  -78.33%)
> +/-                 0.29 (    0.00%)    0.61 (  -51.80%)    1.17 (  -74.90%)    1.33 (  -78.04%)   16.27 (  -98.20%)
> User Time           0.12 (    0.00%)    0.08 (   51.28%)    0.10 (   13.46%)    0.10 (   13.46%)    0.09 (   34.09%)
> +/-                 0.03 (    0.00%)    0.02 (    6.30%)    0.01 (   94.49%)    0.01 (  158.69%)    0.03 (    0.00%)
> Elapsed Time      107.76 (    0.00%)   64.42 (   67.29%)   60.89 (   76.99%)   30.90 (  248.69%)   50.72 (  112.45%)
> +/-                39.83 (    0.00%)   17.03 (  133.90%)    8.75 (  355.08%)   17.20 (  131.65%)   22.87 (   74.14%)
> THP Active         82.80 (    0.00%)   84.20 (    1.66%)   35.00 ( -136.57%)   53.00 (  -56.23%)   36.40 ( -127.47%)
> +/-                81.42 (    0.00%)   80.82 (   -0.74%)   20.89 ( -289.73%)   34.50 ( -136.01%)   51.51 (  -58.07%)
> Fault Alloc       246.00 (    0.00%)  292.60 (   15.93%)   66.20 ( -271.60%)  129.40 (  -90.11%)   90.20 ( -172.73%)
> +/-               173.64 (    0.00%)  161.13 (   -7.76%)   21.30 ( -715.14%)   52.41 ( -231.32%)  104.14 (  -66.74%)
> Fault Fallback    759.60 (    0.00%)  712.40 (    6.63%)  939.80 (  -19.17%)  876.20 (  -13.31%)  915.60 (  -17.04%)
> +/-               173.49 (    0.00%)  161.13 (    7.67%)   21.30 (  714.44%)   52.16 (  232.59%)  104.00 (   66.82%)
> MMTests Statistics: duration
> User/Sys Time Running Test (seconds)         63.41     50.64     39.79      58.8       171
> Total Elapsed Time (seconds)                732.79    483.23    447.17    335.28    412.21

Here again fault alloc goes down from 173-161 to 66 from 3.0-3.1 to
rc2. And elapsed time with ext4 is the same for 3.1.0 and rc2 (thanks
to ext4 instead of vfat). So again no big difference in the
vmscan.c backout.

I think we can keep a limit to how deep reclaim can go (maybe same
percentage logic we use in kswapd for the wmarks), but what was
implemented is too strig and just skipping reclaim when compaction
fails, doesn't sound ok. At least until compaction gets stronger in
being able to compact memory even when few ram is free. Compaction
can't free memory, it takes what is free, and creates "contigous free
pages" from "fragmented free pages". So the more free memory the
higher chance of compaction success. This is why not stopping relclaim
(if compaction fails) is good and increases the success rate
significantly (triple it in the above load) without actually slowing
down the elapsed time at all as shown above.

> THP availability is very variable between iterations and the stall
> times are nowhere near as bad as with vfat but still, my series cuts
> the time to fault in a mapping by half (and that is still very slow).

The results you got in compaction success for the first three columns
is similar to what I get. Of course this is an extreme load with flood
of allocations and tons of ram dirty so it's normal the success rate
of compaction is significantly lower than it would be in a light-VM
condition. But if reclaim is completely stopped by the vmscan patch
then even under light-VM condition the success rate of THP allocation
goes down.

> If you want, the __GFP_NO_KSWAPD check patch can be dropped as that
> will also keep David happy until this series is fully baked if we
> agree in principal that the current stalls are unacceptably large. It
> will be a stretch to complete it in time for 3.2 though meaning that
> stalls in 3.2 will be severe with THP enabled.

Well there's no risk that it will be worse than 3.1 at least. Also
with a more powerful async compaction the chance we end up in sync
compaction diminishes.

I think until we can compact all pages through async compaction, we
should keep sync compaction in. Pages from filesystems that requires a
fallback_migrate_page otherwise would become unmovable but they're in
the middle of a movable block, so that's not ok. We absolutely need a
sync compaction pass to get rid of those. Now you're improving that
part but until we can handle all type of pages I think a final pass of
sync compaction should run.

Now making async compaction more reliable means we call less reclaim
and less sync compaction so it's going to help and we should be still
better than before without having to decrease the reliability of
compaction.

> > When compaction_suitable is happy about the
> > wmarks, the compaction loop isn't still as reliable as it could be in
> > the movable zone, and doing more reclaim helps a lot, the more free
> > pages the more compaction goes ahead and has a change to compact more
> > stuff before the two scans meets in the middle. With Rik we thought
> > about a way to do just a no-swapping reclaim to shrink caches like
> > slab.
> > 
> 
> Ok, is there a patch? I'll investigate this later when I get the time
> but in the shorter term, I'll be looking at why the direct reclaim
> scan figures are so high.

No patch for this one, just my quick backout of the vmscan.c part
and your benchmark above explains why I did it.

> In retrospect, the main difference may be between raw access to the
> device and when the block device backs a filesystem that is currently
> mounted. While mounted, the filesystem may pin pages for metadata
> access which would prevent them ever being migrated and make them
> unsuitable for use with __GFP_MOVABLE.

The fs will search the bh from the hash and get access to the
pagecache from there (to avoid the aliasing between blkdev pagecache
and bh so when you write to /dev/sda on mounted fs it's immediately
seen by the filesystem and the writes from pagecache won't get
discared when the fs flushes its dirty buffers). The pin you mention
should be in the bh, like the superblocks.

But funny thing grow_dev_page already sets __GFP_MOVABLE. That's
pretty weird and it's probably source of a few not movable pages in
the movable block. But then many bh are movable... most of them are,
it's just the superblock that isn't.

But considering grow_dev_page sets __GFP_MOVABLE, any worry about pins
from the fs on the block_dev.c pagecache shouldn't be a concern...

> In Rik's case, this might be less of an issue. It's using the
> def_blk_aops as the address space operations that does not have a
> ->migratepage handler so minimally if they are dirty they have to be
> written out meaning it will behave more like vfat than ext4 in terms
> of compaction performance. I did not spot anything that would actually
> pin the buffers and prevent them from being released though but would
> prefer it was double checked.

Double checking is good idea.

The reason Rik's case can be improved greatly by the __GFP_MOVABLE in
block_dev.c, is that when there's very little movable memory, it's
less likely we compact stuff. All pagecache (actually looks like
buffercache in free but it's pagecache) goes into the non movable
pageblocks, so they get polluted by mixture of slab and stuff. So the
vast majority of memory becomes not compactable. Compaction works
better if all ram is movable of course. With that missing
__GFP_MOVABLE it meant async migration would fail always and invoke
reclaim way more than it would have done it on a real fs. And that
explains his swap storms I think. I havn't reproduced it but that's my
theory at least, and it should be tested.

> If you're going to test a series, can you look at the V4 I posted that
> attempts to reconcile our two series?

I had to look into other things sorry. Hope to get around testing this
too. I actually played on some other change in the compaction scan
which is more a test and not related to the above but I've nothing
conclusive yet.

Also what we have in 3.0/3.1 works _very_ well, there is no urgency to
make too many changes which weren't benchmarked/tested too much yet
and that will improve latency but reduce the allocation rate. If the
allocations are very long lived, giving a better chance to allocate
those pages is good even if there's write I/O. vfat + slow storage is
a bit of a pathological case it doesn't actually apply to all
hardware/fs combinations, __GFP_MOVABLE missing block_dev also was not
so common and it most certainly contributed to a reclaim more
aggressive than it would have happened with that fix. I think you can
push things one at time without urgency here, and I'd prefer maybe if
block_dev patch is applied and the other reversed in vmscan.c or
improved to start limiting only if we're above 8*high or some
percentage check to allow a little more reclaim than rc2 allows
(i.e. no reclaim at all which likely results in a failure in hugepage
allocation). Not unlimited as 3.1 is ok with me but if kswapd can free
a percentage I don't see why reclaim can't (consdiering more free
pages in movable pageblocks are needed to succeed compaction). The
ideal is to improve the compaction rate and at the same time reduce
reclaim aggressiveness. Let's start with the parts that are more
obviously right fixes and that don't risk regressions, we don't want
compaction regressions :).

Thanks for taking care of sorting this!

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/5] mm: compaction: Determine if dirty pages can be migreated without blocking within ->migratepage
  2011-11-24  1:19           ` Andrea Arcangeli
@ 2011-11-24 12:21             ` Mel Gorman
  2011-11-26  6:51               ` Andy Isaacson
  2011-11-27 20:50               ` Rik van Riel
  0 siblings, 2 replies; 40+ messages in thread
From: Mel Gorman @ 2011-11-24 12:21 UTC (permalink / raw)
  To: Andrea Arcangeli
  Cc: Linux-MM, Minchan Kim, Jan Kara, Andy Isaacson, Johannes Weiner, LKML

On Thu, Nov 24, 2011 at 02:19:43AM +0100, Andrea Arcangeli wrote:
> On Tue, Nov 22, 2011 at 12:59:06PM +0000, Mel Gorman wrote:
> > On Mon, Nov 21, 2011 at 11:45:45PM +0100, Andrea Arcangeli wrote:
> > > On Mon, Nov 21, 2011 at 11:17:26AM +0000, Mel Gorman wrote:
> > > > On Fri, Nov 18, 2011 at 10:35:30PM +0100, Andrea Arcangeli wrote:
> > > > > folks who wants low latency or no memory overhead should simply
> > > > > disable compaction.
> > > > 
> > > > That strikes me as being somewhat heavy handed. Compaction should be as
> > > > low latency as possible.
> > > 
> > > Yes I was meaning in the very short term. Optimizations are always
> > > possible :) we've just to sort out some issues (as previous part of
> > > the email discussed).
> > > 
> > 
> > To put some figures on the latency impacts, I added a test to MM
> > Tests similar to yours that is in three parts
> > 
> > 1. A process reads /dev/sda and writes to /dev/null
> 
> Yes also note, ironically this is likely to be a better test for this
> without the __GFP_MOVABLE in block_dev.c. Even if we want it fixed,
> maybe another source that reduces the non movable pages may be needed then.
> 

I'm also running other tests to avoid tuning for just this test cases.
Right now, the list looks like;

1. postmark with something creating anonymous mappings in the background
2. plain USB writing while creating anonymous mappings
3. Same this as described here except instead of reading /dev/sda, it's
   reading a large file off a filesystem

Unfortunately I had to restart all the tests yesterday evening as the
USB stick got trashed (to be fair, I had severely abused it). That
set things back a bit.

> > 2. A process reads from /dev/zero and writes to a filesystem on a USB stick
> > 3. When memory is full, a process starts that creates an anonymous
> >    mapping
> 
> Great.
> 
> > Looking at the vfat figures, we see
> > 
> > THPAVAIL
> >             thpavail-3.0.0-vanilla-thpavailthpavail-3.1.0-vanilla-thpavail    thpavail-3.2.0         3.2.0-rc2         3.2.0-rc2
> >                  3.0.0-vanilla     3.1.0-vanilla       rc2-vanilla migratedirty-v4r2    synclight-v4r2
> > System Time         1.89 (    0.00%)    1.90 (   -0.21%)    3.83 (  -50.63%)   19.95 (  -90.52%)   71.91 (  -97.37%)
> > +/-                 0.15 (    0.00%)    0.05 (  205.45%)    2.39 (  -93.87%)   27.59 (  -99.47%)    0.75 (  -80.49%)
> > User Time           0.10 (    0.00%)    0.10 (   -4.00%)    0.10 (   -7.69%)    0.12 (  -17.24%)    0.11 (   -9.43%)
> > +/-                 0.02 (    0.00%)    0.02 (  -24.28%)    0.03 (  -38.31%)    0.02 (    0.00%)    0.04 (  -56.57%)
> > Elapsed Time      986.53 (    0.00%) 1189.77 (  -17.08%)  589.48 (   67.36%)  506.45 (   94.79%)   73.39 ( 1244.27%)
> > +/-                35.52 (    0.00%)   90.09 (  -60.57%)   49.00 (  -27.51%)  213.56 (  -83.37%)    0.47 ( 7420.98%)
> > THP Active        118.80 (    0.00%)   89.20 (  -33.18%)   35.40 ( -235.59%)    8.00 (-1385.00%)   44.00 ( -170.00%)
> > +/-                65.57 (    0.00%)   35.37 (  -85.38%)   29.23 ( -124.33%)   12.51 ( -424.28%)   19.83 ( -230.65%)
> > Fault Alloc       308.80 (    0.00%)  244.20 (  -26.45%)   59.40 ( -419.87%)   81.00 ( -281.23%)   95.80 ( -222.34%)
> > +/-                81.62 (    0.00%)   80.53 (   -1.36%)   26.85 ( -203.98%)   67.85 (  -20.30%)   34.29 ( -138.05%)
> > Fault Fallback    697.20 (    0.00%)  761.60 (   -8.46%)  946.40 (  -26.33%)  924.80 (  -24.61%)  909.80 (  -23.37%)
> > +/-                81.62 (    0.00%)   80.52 (    1.37%)   26.65 (  206.28%)   67.93 (   20.16%)   34.25 (  138.33%)
> > MMTests Statistics: duration
> > User/Sys Time Running Test (seconds)        541.58    645.79   1220.88   1131.04    976.54
> > Total Elapsed Time (seconds)               5066.05   6083.58   3068.08   2653.11    493.96
> 
> Notice how the fault alloc goes down from 308-244 to 59 in
> rc2-vanilla, that's the vmscan.c patch I backed out in my series...

Yes, although that backout is nak'd already due to its agression. I
recognise the problem but I'm not considering the revert as a solution
at the moment. I know it helps THP allocation success rates but at
the cost of performance for other workloads if it reclaims too much.

> So
> that explains why I backed out that bit. Ok the elapsed time improved
> because reclaim is used less, but nothing close to x10 times faster
> thanks to it, even for this workload and even on vfat, and if you add
> __GFP_MOVABLE to block_dev.c things will likely change for the better
> too.

Have you checked that those allocations are really movable? I haven't
and it's relatively low on my list of priorities unfortunately.

> (It was x10 times slower if that lead to swapping, likely not the
> case here, but again the __GFP_MOVABLE should take care of it)
> 
> > Straight off, 3.0, 3.1 and 3.2-rc2 have a number of THPs in use. 3.0
> > had a rougly 30% success rate with 10% of the mapping using THP but
> > look at the cost in time. It took about 16 minutes per iteration (5
> > iterations which is why total elapsed time looks large) to fault in a
> > mapping the size of physical memory versus 1.5 minutes with my series
> > (which is still very slow).
> 
> I'd be curious if you tested my exact status too.
> 

Not yet, these tests take time and the loss of the USB stick didn't
help but I'll run it over the weekend.

> > In my series system time is stupidly high but I haven't run oprofile
> > to see exactly where. It could be because it's running compaction
> > because defer_compaction() is not being called at the right times
> > but it could also be because direct reclaim scanning is through the
> > roof albeit better than 3.2-rc2 vanilla. I suspect direct reclaim
> > scanning could be because we are not calling ->writepage in direct
> > reclaim any more - 3.0 wrote 178215 pages while my series wrote 5.
> > 
> > The story looks better for ext4 as it is not writing page pages in
> > fallback_migrate_page
> 
> That certianly has an effect on the usb stick as they're all vfat.
> 

I thought initially that the stupidly high scanning was due to large
numbers of pages under writeback but the counts are too low for that
to be the problem. The immediate reclaim logic for PageReclaim pages
appears to be working. It also is getting throttled although woken
quickly by the faster device. In itself is not a problem but is
the next direction I'm looking.

> > THPAVAIL
> >             thpavail-3.0.0-vanilla-thpavailthpavail-3.1.0-vanilla-thpavail    thpavail-3.2.0         3.2.0-rc2         3.2.0-rc2
> >                  3.0.0-vanilla     3.1.0-vanilla       rc2-vanilla migratedirty-v4r2    synclight-v4r2
> > System Time         2.64 (    0.00%)    2.49 (    6.10%)    3.09 (  -14.38%)    4.29 (  -38.40%)   12.20 (  -78.33%)
> > +/-                 0.29 (    0.00%)    0.61 (  -51.80%)    1.17 (  -74.90%)    1.33 (  -78.04%)   16.27 (  -98.20%)
> > User Time           0.12 (    0.00%)    0.08 (   51.28%)    0.10 (   13.46%)    0.10 (   13.46%)    0.09 (   34.09%)
> > +/-                 0.03 (    0.00%)    0.02 (    6.30%)    0.01 (   94.49%)    0.01 (  158.69%)    0.03 (    0.00%)
> > Elapsed Time      107.76 (    0.00%)   64.42 (   67.29%)   60.89 (   76.99%)   30.90 (  248.69%)   50.72 (  112.45%)
> > +/-                39.83 (    0.00%)   17.03 (  133.90%)    8.75 (  355.08%)   17.20 (  131.65%)   22.87 (   74.14%)
> > THP Active         82.80 (    0.00%)   84.20 (    1.66%)   35.00 ( -136.57%)   53.00 (  -56.23%)   36.40 ( -127.47%)
> > +/-                81.42 (    0.00%)   80.82 (   -0.74%)   20.89 ( -289.73%)   34.50 ( -136.01%)   51.51 (  -58.07%)
> > Fault Alloc       246.00 (    0.00%)  292.60 (   15.93%)   66.20 ( -271.60%)  129.40 (  -90.11%)   90.20 ( -172.73%)
> > +/-               173.64 (    0.00%)  161.13 (   -7.76%)   21.30 ( -715.14%)   52.41 ( -231.32%)  104.14 (  -66.74%)
> > Fault Fallback    759.60 (    0.00%)  712.40 (    6.63%)  939.80 (  -19.17%)  876.20 (  -13.31%)  915.60 (  -17.04%)
> > +/-               173.49 (    0.00%)  161.13 (    7.67%)   21.30 (  714.44%)   52.16 (  232.59%)  104.00 (   66.82%)
> > MMTests Statistics: duration
> > User/Sys Time Running Test (seconds)         63.41     50.64     39.79      58.8       171
> > Total Elapsed Time (seconds)                732.79    483.23    447.17    335.28    412.21
> 
> Here again fault alloc goes down from 173-161 to 66 from 3.0-3.1 to
> rc2. And elapsed time with ext4 is the same for 3.1.0 and rc2 (thanks
> to ext4 instead of vfat). So again no big difference in the
> vmscan.c backout.
> 
> I think we can keep a limit to how deep reclaim can go (maybe same
> percentage logic we use in kswapd for the wmarks), but what was
> implemented is too strig and just skipping reclaim when compaction
> fails, doesn't sound ok.

Neither is reclaiming way too much memory. I think the answer there
is limiting how deep reclaim goes but have not actually gotten around
to implementing anything yet. I've been distracted unfortunately.

> At least until compaction gets stronger in
> being able to compact memory even when few ram is free. Compaction
> can't free memory, it takes what is free, and creates "contigous free
> pages" from "fragmented free pages". So the more free memory the
> higher chance of compaction success.

Fully agreed and this is part of the reason why min_free_kbytes is
higher when THP is enabled - the other part being it helps fragmentation
avoidance.

> This is why not stopping relclaim
> (if compaction fails) is good and increases the success rate
> significantly (triple it in the above load) without actually slowing
> down the elapsed time at all as shown above.
> 
> > THP availability is very variable between iterations and the stall
> > times are nowhere near as bad as with vfat but still, my series cuts
> > the time to fault in a mapping by half (and that is still very slow).
> 
> The results you got in compaction success for the first three columns
> is similar to what I get. Of course this is an extreme load with flood
> of allocations and tons of ram dirty so it's normal the success rate
> of compaction is significantly lower than it would be in a light-VM
> condition. But if reclaim is completely stopped by the vmscan patch
> then even under light-VM condition the success rate of THP allocation
> goes down.
> 

Fully understood and agreed.

> > If you want, the __GFP_NO_KSWAPD check patch can be dropped as that
> > will also keep David happy until this series is fully baked if we
> > agree in principal that the current stalls are unacceptably large. It
> > will be a stretch to complete it in time for 3.2 though meaning that
> > stalls in 3.2 will be severe with THP enabled.
> 
> Well there's no risk that it will be worse than 3.1 at least.

Other the continued risk of hearing complaints about stalling of
course.

> Also
> with a more powerful async compaction the chance we end up in sync
> compaction diminishes.
> 

That is a large part of the motivation for this series. Keeping sync
compaction may be necessary (which is why my newer series keeps it)
but we can reduce how often it is used.

> > > <SNIP>
> > > When compaction_suitable is happy about the
> > > wmarks, the compaction loop isn't still as reliable as it could be in
> > > the movable zone, and doing more reclaim helps a lot, the more free
> > > pages the more compaction goes ahead and has a change to compact more
> > > stuff before the two scans meets in the middle. With Rik we thought
> > > about a way to do just a no-swapping reclaim to shrink caches like
> > > slab.
> > > 
> > 
> > Ok, is there a patch? I'll investigate this later when I get the time
> > but in the shorter term, I'll be looking at why the direct reclaim
> > scan figures are so high.
> 
> No patch for this one, just my quick backout of the vmscan.c part
> and your benchmark above explains why I did it.
> 

And it's helpful for illustration but a revert is not going to fly.

> > In retrospect, the main difference may be between raw access to the
> > device and when the block device backs a filesystem that is currently
> > mounted. While mounted, the filesystem may pin pages for metadata
> > access which would prevent them ever being migrated and make them
> > unsuitable for use with __GFP_MOVABLE.
> 
> The fs will search the bh from the hash and get access to the
> pagecache from there (to avoid the aliasing between blkdev pagecache
> and bh so when you write to /dev/sda on mounted fs it's immediately
> seen by the filesystem and the writes from pagecache won't get
> discared when the fs flushes its dirty buffers). The pin you mention
> should be in the bh, like the superblocks.
> 
> But funny thing grow_dev_page already sets __GFP_MOVABLE. That's
> pretty weird and it's probably source of a few not movable pages in
> the movable block. But then many bh are movable... most of them are,
> it's just the superblock that isn't.
> 
> But considering grow_dev_page sets __GFP_MOVABLE, any worry about pins
> from the fs on the block_dev.c pagecache shouldn't be a concern...
> 

Except in quantity. We can cope with some pollution of MIGRATE_MOVABLE
but if it gets excessive, it will cause a lot of trouble. Superblock
bh's may not be movable but there are not many of them and they are
long lived.

> > In Rik's case, this might be less of an issue. It's using the
> > def_blk_aops as the address space operations that does not have a
> > ->migratepage handler so minimally if they are dirty they have to be
> > written out meaning it will behave more like vfat than ext4 in terms
> > of compaction performance. I did not spot anything that would actually
> > pin the buffers and prevent them from being released though but would
> > prefer it was double checked.
> 
> Double checking is good idea.
> 
> The reason Rik's case can be improved greatly by the __GFP_MOVABLE in
> block_dev.c, is that when there's very little movable memory, it's
> less likely we compact stuff. All pagecache (actually looks like
> buffercache in free but it's pagecache) goes into the non movable
> pageblocks, so they get polluted by mixture of slab and stuff. So the
> vast majority of memory becomes not compactable. Compaction works
> better if all ram is movable of course. With that missing
> __GFP_MOVABLE it meant async migration would fail always and invoke
> reclaim way more than it would have done it on a real fs. And that
> explains his swap storms I think. I havn't reproduced it but that's my
> theory at least, and it should be tested.
> 

Testing to confirm will be necessary as well as checking if the pages
are really movable and what happens if the block device is backed by
a mounted filesystem.

> > If you're going to test a series, can you look at the V4 I posted that
> > attempts to reconcile our two series?
> 
> I had to look into other things sorry. Hope to get around testing this
> too. I actually played on some other change in the compaction scan
> which is more a test and not related to the above but I've nothing
> conclusive yet.
> 

Understood.

> Also what we have in 3.0/3.1 works _very_ well, there is no urgency to
> make too many changes which weren't benchmarked/tested too much yet
> and that will improve latency but reduce the allocation rate. If the
> allocations are very long lived, giving a better chance to allocate
> those pages is good even if there's write I/O. vfat + slow storage is
> a bit of a pathological case it doesn't actually apply to all
> hardware/fs combinations,

True, but it's an interesting case at the same time. I reckon if we
can make it work well, it'll improve a number of other stall-related
issues at the same time.

> __GFP_MOVABLE missing block_dev also was not
> so common and it most certainly contributed to a reclaim more
> aggressive than it would have happened with that fix. I think you can
> push things one at time without urgency here, and I'd prefer maybe if
> block_dev patch is applied and the other reversed in vmscan.c or
> improved to start limiting only if we're above 8*high or some
> percentage check to allow a little more reclaim than rc2 allows

The limiting is my current preferred option - at least until it is
confirmed that it really is ok to mark block_dev pages movable and that
Rik is ok with the revert.

> (i.e. no reclaim at all which likely results in a failure in hugepage
> allocation). Not unlimited as 3.1 is ok with me but if kswapd can free
> a percentage I don't see why reclaim can't (consdiering more free
> pages in movable pageblocks are needed to succeed compaction). The
> ideal is to improve the compaction rate and at the same time reduce
> reclaim aggressiveness. Let's start with the parts that are more
> obviously right fixes and that don't risk regressions, we don't want
> compaction regressions :).
> 

I don't think there are any "obviously right fixes" right now until the
block_dev patch is proven to be ok and that reverting does not regress
Rik's workload. Going to take time.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/5] mm: compaction: Determine if dirty pages can be migreated without blocking within ->migratepage
  2011-11-24 12:21             ` Mel Gorman
@ 2011-11-26  6:51               ` Andy Isaacson
  2011-11-27 20:50               ` Rik van Riel
  1 sibling, 0 replies; 40+ messages in thread
From: Andy Isaacson @ 2011-11-26  6:51 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Linux-MM, Minchan Kim, Jan Kara, Johannes Weiner, LKML

On Thu, Nov 24, 2011 at 12:21:44PM +0000, Mel Gorman wrote:
> On Thu, Nov 24, 2011 at 02:19:43AM +0100, Andrea Arcangeli wrote:
> > Yes also note, ironically this is likely to be a better test for this
> > without the __GFP_MOVABLE in block_dev.c. Even if we want it fixed,
> > maybe another source that reduces the non movable pages may be needed then.
> > 
> 
> I'm also running other tests to avoid tuning for just this test cases.
> Right now, the list looks like;
> 
> 1. postmark with something creating anonymous mappings in the background
> 2. plain USB writing while creating anonymous mappings

I've been testing the original case that started this thread -- writing
multiple GB to a USB attached, FAT, very slow SD card.

I'm currently running 7f80850d + "mm: Do not stall in synchronous
compaction for THP allocations" Mel's original patch.  With this patch I
cannot reproduce the hangs that I saw.  I haven't retried without the
patch to confirm that they're reproducible, though.

someone asked about CONFIG_NUMA; I have CONFIG_NUMA=y.

I can reboot this weekend; what patches should I test with next?

-andy

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 4/5] mm: compaction: Determine if dirty pages can be migreated without blocking within ->migratepage
  2011-11-24 12:21             ` Mel Gorman
  2011-11-26  6:51               ` Andy Isaacson
@ 2011-11-27 20:50               ` Rik van Riel
  1 sibling, 0 replies; 40+ messages in thread
From: Rik van Riel @ 2011-11-27 20:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Andrea Arcangeli, Linux-MM, Minchan Kim, Jan Kara, Andy Isaacson,
	Johannes Weiner, LKML

On 11/24/2011 07:21 AM, Mel Gorman wrote:
> On Thu, Nov 24, 2011 at 02:19:43AM +0100, Andrea Arcangeli wrote:

>> But funny thing grow_dev_page already sets __GFP_MOVABLE. That's
>> pretty weird and it's probably source of a few not movable pages in
>> the movable block. But then many bh are movable... most of them are,
>> it's just the superblock that isn't.
>>
>> But considering grow_dev_page sets __GFP_MOVABLE, any worry about pins
>> from the fs on the block_dev.c pagecache shouldn't be a concern...
>>
>
> Except in quantity. We can cope with some pollution of MIGRATE_MOVABLE
> but if it gets excessive, it will cause a lot of trouble. Superblock
> bh's may not be movable but there are not many of them and they are
> long lived.

We're potentially doomed either way :)

If we allocate a lot of movable pages in non-movable
blocks, we can end up with a lot of slightly polluted
blocks even after reclaiming all the reclaimable page
cache.

If we allocate a few non-movable pages in movable
blocks, we can end up with the same situation.

Either way, we can potentially end up with a lot of
memory that cannot be defragmented.

Of course, it could take the mounting of a lot of
filesystems for this problem to be triggered, but we
know there are people doing that.

>> __GFP_MOVABLE missing block_dev also was not
>> so common and it most certainly contributed to a reclaim more
>> aggressive than it would have happened with that fix. I think you can
>> push things one at time without urgency here, and I'd prefer maybe if
>> block_dev patch is applied and the other reversed in vmscan.c or
>> improved to start limiting only if we're above 8*high or some
>> percentage check to allow a little more reclaim than rc2 allows
>
> The limiting is my current preferred option - at least until it is
> confirmed that it really is ok to mark block_dev pages movable and that
> Rik is ok with the revert.

I am fine with replacing the compaction checks with free limit
checks. Funny enough, the first iteration of the patch I submitted
to limit reclaim used a free limit check :)

I also suspect we will want to call shrink_slab regardless of
whether or not a memory zone is already over its free limit for
direct reclaim, since that has the potential to free an otherwise
unmovable page.

>> (i.e. no reclaim at all which likely results in a failure in hugepage
>> allocation). Not unlimited as 3.1 is ok with me but if kswapd can free
>> a percentage I don't see why reclaim can't (consdiering more free
>> pages in movable pageblocks are needed to succeed compaction). The
>> ideal is to improve the compaction rate and at the same time reduce
>> reclaim aggressiveness. Let's start with the parts that are more
>> obviously right fixes and that don't risk regressions, we don't want
>> compaction regressions :).
>>
>
> I don't think there are any "obviously right fixes" right now until the
> block_dev patch is proven to be ok and that reverting does not regress
> Rik's workload. Going to take time.

Ironically the test Andrea is measuring THP allocations with
(dd from /dev/sda to /dev/null) is functionally equivalent to
me running KVM guests with cache=writethrough directly from
a block device.

The difference is that Andrea is measuring THP allocation
success rate, while I am watching how well the programs (and
KVM guests) actually run.

Not surprisingly, swapping out the working set has a pretty
catastrophic effect on performance, even if it helps THP
allocation success :)

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2011-11-27 20:50 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-11-18 16:58 [RFC PATCH 0/5] Reduce compaction-related stalls and improve asynchronous migration of dirty pages v3 Mel Gorman
2011-11-18 16:58 ` [PATCH 1/5] mm: compaction: Allow compaction to isolate dirty pages Mel Gorman
2011-11-18 17:28   ` Andrea Arcangeli
2011-11-21 17:16   ` Rik van Riel
2011-11-18 16:58 ` [PATCH 2/5] mm: compaction: Use synchronous compaction for /proc/sys/vm/compact_memory Mel Gorman
2011-11-18 17:27   ` Andrea Arcangeli
2011-11-21 21:46   ` Rik van Riel
2011-11-18 16:58 ` [PATCH 3/5] mm: Do not stall in synchronous compaction for THP allocations Mel Gorman
2011-11-18 17:34   ` Andrea Arcangeli
2011-11-18 16:58 ` [PATCH 4/5] mm: compaction: Determine if dirty pages can be migreated without blocking within ->migratepage Mel Gorman
2011-11-18 21:35   ` Andrea Arcangeli
2011-11-21 11:17     ` Mel Gorman
2011-11-21 22:45       ` Andrea Arcangeli
2011-11-22  0:55         ` [PATCH] mm: compaction: make buffer cache __GFP_MOVABLE Rik van Riel
2011-11-22 12:59         ` [PATCH 4/5] mm: compaction: Determine if dirty pages can be migreated without blocking within ->migratepage Mel Gorman
2011-11-24  1:19           ` Andrea Arcangeli
2011-11-24 12:21             ` Mel Gorman
2011-11-26  6:51               ` Andy Isaacson
2011-11-27 20:50               ` Rik van Riel
2011-11-19  8:59   ` Nai Xia
2011-11-19  9:48     ` Nai Xia
2011-11-21 11:19     ` Mel Gorman
2011-11-18 16:58 ` [PATCH 5/5] mm: compaction: make isolate_lru_page() filter-aware again Mel Gorman
2011-11-19 19:54 ` [RFC PATCH 0/5] Reduce compaction-related stalls Andrea Arcangeli
2011-11-19 19:54 ` [PATCH 1/8] mm: compaction: Allow compaction to isolate dirty pages Andrea Arcangeli
2011-11-19 19:54 ` [PATCH 2/8] mm: compaction: Use synchronous compaction for /proc/sys/vm/compact_memory Andrea Arcangeli
2011-11-19 19:54 ` [PATCH 3/8] mm: check if we isolated a compound page during lumpy scan Andrea Arcangeli
2011-11-21 11:51   ` Mel Gorman
2011-11-19 19:54 ` [PATCH 4/8] mm: compaction: defer compaction only with sync_migration Andrea Arcangeli
2011-11-21 12:36   ` Mel Gorman
2011-11-19 19:54 ` [PATCH 5/8] mm: compaction: avoid overwork in migrate sync mode Andrea Arcangeli
2011-11-21 21:59   ` Rik van Riel
2011-11-22  9:51     ` Mel Gorman
2011-11-19 19:54 ` [PATCH 6/8] Revert "mm: compaction: make isolate_lru_page() filter-aware" Andrea Arcangeli
2011-11-21 12:57   ` Mel Gorman
2011-11-19 19:54 ` [PATCH 7/8] Revert "vmscan: abort reclaim/compaction if compaction can proceed" Andrea Arcangeli
2011-11-21 13:09   ` Mel Gorman
2011-11-21 15:37     ` Rik van Riel
2011-11-19 19:54 ` [PATCH 8/8] Revert "vmscan: limit direct reclaim for higher order allocations" Andrea Arcangeli
2011-11-21 21:57   ` Rik van Riel

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).