linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/17] make direct compaction more deterministic
@ 2016-06-24  9:54 Vlastimil Babka
  2016-06-24  9:54 ` [PATCH v3 01/17] mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode Vlastimil Babka
                   ` (16 more replies)
  0 siblings, 17 replies; 37+ messages in thread
From: Vlastimil Babka @ 2016-06-24  9:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Mel Gorman, Joonsoo Kim,
	David Rientjes, Rik van Riel, Vlastimil Babka

Changes since v2:

* Rebase on mmotm-2016-06-15-16-18 with Mel's node-based reclaim series dropped
  locally. Note there will be some small conflicts, but nothing substantial
  that should complicate readding Mel's series later and invalidate testing.
* Dropped patch 18 which was the only major conflict with Mel's series, which
  solves the same thing, and it wasn't that important in this series.
* The rebasing however required some major rewrite of patches 2 and 3 due to
  changes in mmotm, so I dropped the acks. Changes there should also address
  reviewers' concerns. E.g. ALLOC_NO_WATERMARKS is used in direct reclaim and
  compaction attempts after patch 3, as Joonsoo suggested.
* In patch 12, compaction retries are now only counted after reaching the final
  priority (suggested by Michal Hocko).

Changes since v1 RFC:

* Incorporate feedback from Michal, Joonsoo, Tetsuo
* Expanded cleanup of watermark checks controlling reclaim/compaction

This is mostly a followup to Michal's oom detection rework, which highlighted
the need for direct compaction to provide better feedback in reclaim/compaction
loop, so that it can reliably recognize when compaction cannot make further
progress, and allocation should invoke OOM killer or fail. We've discussed
this at LSF/MM [1] where I proposed expanding the async/sync migration mode
used in compaction to more general "priorities". This patchset adds one new
priority that just overrides all the heuristics and makes compaction fully
scan all zones. I don't currently think that we need more fine-grained
priorities, but we'll see. Other than that there's some smaller fixes and
cleanups, mainly related to the THP-specific hacks.

I've tested this with stress-highalloc in GFP_KERNEL order-4 and
GFP_HIGHUSER_MOVABLE order-9 scenarios. There's not much report but noise,
except reductions in direct reclaim.

order-9:

Direct pages scanned                238949       41502
Kswapd pages scanned               2069710     2229295
Kswapd pages reclaimed             1981047     2139089
Direct pages reclaimed              236534       41502

order-4:

Direct pages scanned                204214      110733
Kswapd pages scanned               2125221     2179180
Kswapd pages reclaimed             2027102     2098257
Direct pages reclaimed              194942      110695

Also Patch 1 describes reductions in page migration failures.

[1] https://lwn.net/Articles/684611/

Hugh Dickins (1):
  mm, compaction: don't isolate PageWriteback pages in
    MIGRATE_SYNC_LIGHT mode

Vlastimil Babka (16):
  mm, page_alloc: set alloc_flags only once in slowpath
  mm, page_alloc: don't retry initial attempt in slowpath
  mm, page_alloc: restructure direct compaction handling in slowpath
  mm, page_alloc: make THP-specific decisions more generic
  mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
  mm, compaction: introduce direct compaction priority
  mm, compaction: simplify contended compaction handling
  mm, compaction: make whole_zone flag ignore cached scanner positions
  mm, compaction: cleanup unused functions
  mm, compaction: add the ultimate direct compaction priority
  mm, compaction: more reliably increase direct compaction priority
  mm, compaction: use correct watermark when checking allocation success
  mm, compaction: create compact_gap wrapper
  mm, compaction: use proper alloc_flags in __compaction_suitable()
  mm, compaction: require only min watermarks for non-costly orders
  mm, vmscan: make compaction_ready() more accurate and readable

 include/linux/compaction.h        |  84 ++++++-------
 include/linux/gfp.h               |  14 ++-
 include/trace/events/compaction.h |  12 +-
 include/trace/events/mmflags.h    |   1 +
 mm/compaction.c                   | 186 +++++++++------------------
 mm/huge_memory.c                  |  29 +++--
 mm/internal.h                     |   7 +-
 mm/khugepaged.c                   |   2 +-
 mm/migrate.c                      |   2 +-
 mm/page_alloc.c                   | 258 ++++++++++++++++++--------------------
 mm/vmscan.c                       |  47 ++++---
 tools/perf/builtin-kmem.c         |   1 +
 12 files changed, 281 insertions(+), 362 deletions(-)

-- 
2.8.4

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v3 01/17] mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode
  2016-06-24  9:54 [PATCH v3 00/17] make direct compaction more deterministic Vlastimil Babka
@ 2016-06-24  9:54 ` Vlastimil Babka
  2016-06-24  9:54 ` [PATCH v3 02/17] mm, page_alloc: set alloc_flags only once in slowpath Vlastimil Babka
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 37+ messages in thread
From: Vlastimil Babka @ 2016-06-24  9:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Mel Gorman, Joonsoo Kim,
	David Rientjes, Rik van Riel, Hugh Dickins, Vlastimil Babka

From: Hugh Dickins <hughd@google.com>

At present MIGRATE_SYNC_LIGHT is allowing __isolate_lru_page() to
isolate a PageWriteback page, which __unmap_and_move() then rejects
with -EBUSY: of course the writeback might complete in between, but
that's not what we usually expect, so probably better not to isolate it.

When tested by stress-highalloc from mmtests, this has reduced the number of
page migrate failures by 60-70%.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/compaction.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index c611b8a42023..b7b696e46eaa 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1204,7 +1204,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 	struct page *page;
 	const isolate_mode_t isolate_mode =
 		(sysctl_compact_unevictable_allowed ? ISOLATE_UNEVICTABLE : 0) |
-		(cc->mode == MIGRATE_ASYNC ? ISOLATE_ASYNC_MIGRATE : 0);
+		(cc->mode != MIGRATE_SYNC ? ISOLATE_ASYNC_MIGRATE : 0);
 
 	/*
 	 * Start at where we last stopped, or beginning of the zone as
-- 
2.8.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 02/17] mm, page_alloc: set alloc_flags only once in slowpath
  2016-06-24  9:54 [PATCH v3 00/17] make direct compaction more deterministic Vlastimil Babka
  2016-06-24  9:54 ` [PATCH v3 01/17] mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode Vlastimil Babka
@ 2016-06-24  9:54 ` Vlastimil Babka
  2016-06-30 14:44   ` Michal Hocko
  2016-06-24  9:54 ` [PATCH v3 03/17] mm, page_alloc: don't retry initial attempt " Vlastimil Babka
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 37+ messages in thread
From: Vlastimil Babka @ 2016-06-24  9:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Mel Gorman, Joonsoo Kim,
	David Rientjes, Rik van Riel, Vlastimil Babka

In __alloc_pages_slowpath(), alloc_flags doesn't change after it's initialized,
so move the initialization above the retry: label. Also make the comment above
the initialization more descriptive.

The only exception in the alloc_flags being constant is ALLOC_NO_WATERMARKS,
which may change due to TIF_MEMDIE being set on the allocating thread. We can
fix this, and make the code simpler and a bit more effective at the same time,
by moving the part that determines ALLOC_NO_WATERMARKS from
gfp_to_alloc_flags() to gfp_pfmemalloc_allowed(). This means we don't have to
mask out ALLOC_NO_WATERMARKS in numerous places in __alloc_pages_slowpath()
anymore. The only two tests for the flag can instead call
gfp_pfmemalloc_allowed().

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 52 ++++++++++++++++++++++++++--------------------------
 1 file changed, 26 insertions(+), 26 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 89128d64d662..82545274adbe 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3193,8 +3193,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	 */
 	count_vm_event(COMPACTSTALL);
 
-	page = get_page_from_freelist(gfp_mask, order,
-					alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
+	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
 
 	if (page) {
 		struct zone *zone = page_zone(page);
@@ -3362,8 +3361,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 		return NULL;
 
 retry:
-	page = get_page_from_freelist(gfp_mask, order,
-					alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
+	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
 
 	/*
 	 * If an allocation failed after direct reclaim, it could be because
@@ -3421,16 +3419,6 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	} else if (unlikely(rt_task(current)) && !in_interrupt())
 		alloc_flags |= ALLOC_HARDER;
 
-	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (gfp_mask & __GFP_MEMALLOC)
-			alloc_flags |= ALLOC_NO_WATERMARKS;
-		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
-			alloc_flags |= ALLOC_NO_WATERMARKS;
-		else if (!in_interrupt() &&
-				((current->flags & PF_MEMALLOC) ||
-				 unlikely(test_thread_flag(TIF_MEMDIE))))
-			alloc_flags |= ALLOC_NO_WATERMARKS;
-	}
 #ifdef CONFIG_CMA
 	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
@@ -3440,7 +3428,19 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 
 bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 {
-	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
+	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
+		return false;
+
+	if (gfp_mask & __GFP_MEMALLOC)
+		return true;
+	if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
+		return true;
+	if (!in_interrupt() &&
+			((current->flags & PF_MEMALLOC) ||
+			 unlikely(test_thread_flag(TIF_MEMDIE))))
+		return true;
+
+	return false;
 }
 
 static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
@@ -3575,36 +3575,36 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
 		gfp_mask &= ~__GFP_ATOMIC;
 
-retry:
-	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
-		wake_all_kswapds(order, ac);
-
 	/*
-	 * OK, we're below the kswapd watermark and have kicked background
-	 * reclaim. Now things get more complex, so set up alloc_flags according
-	 * to how we want to proceed.
+	 * The fast path uses conservative alloc_flags to succeed only until
+	 * kswapd needs to be woken up, and to avoid the cost of setting up
+	 * alloc_flags precisely. So we do that now.
 	 */
 	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
+retry:
+	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
+		wake_all_kswapds(order, ac);
+
 	/*
 	 * Reset the zonelist iterators if memory policies can be ignored.
 	 * These allocations are high priority and system rather than user
 	 * orientated.
 	 */
-	if ((alloc_flags & ALLOC_NO_WATERMARKS) || !(alloc_flags & ALLOC_CPUSET)) {
+	if (!(alloc_flags & ALLOC_CPUSET) || gfp_pfmemalloc_allowed(gfp_mask)) {
 		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
 		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
 					ac->high_zoneidx, ac->nodemask);
 	}
 
 	/* This is the last chance, in general, before the goto nopage. */
-	page = get_page_from_freelist(gfp_mask, order,
-				alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
+	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
 	if (page)
 		goto got_pg;
 
 	/* Allocate without watermarks if the context allows */
-	if (alloc_flags & ALLOC_NO_WATERMARKS) {
+	if (gfp_pfmemalloc_allowed(gfp_mask)) {
+
 		page = get_page_from_freelist(gfp_mask, order,
 						ALLOC_NO_WATERMARKS, ac);
 		if (page)
-- 
2.8.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 03/17] mm, page_alloc: don't retry initial attempt in slowpath
  2016-06-24  9:54 [PATCH v3 00/17] make direct compaction more deterministic Vlastimil Babka
  2016-06-24  9:54 ` [PATCH v3 01/17] mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode Vlastimil Babka
  2016-06-24  9:54 ` [PATCH v3 02/17] mm, page_alloc: set alloc_flags only once in slowpath Vlastimil Babka
@ 2016-06-24  9:54 ` Vlastimil Babka
  2016-06-30 15:03   ` Michal Hocko
  2016-06-24  9:54 ` [PATCH v3 04/17] mm, page_alloc: restructure direct compaction handling " Vlastimil Babka
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 37+ messages in thread
From: Vlastimil Babka @ 2016-06-24  9:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Mel Gorman, Joonsoo Kim,
	David Rientjes, Rik van Riel, Vlastimil Babka

After __alloc_pages_slowpath() sets up new alloc_flags and wakes up kswapd, it
first tries get_page_from_freelist() with the new alloc_flags, as it may
succeed e.g. due to using min watermark instead of low watermark. It makes
sense to to do this attempt before adjusting zonelist based on
alloc_flags/gfp_mask, as it's still relatively a fast path if we just wake up
kswapd and successfully allocate.

This patch therefore moves the initial attempt above the retry label and
reorganizes a bit the part below the retry label. We still have to attempt
get_page_from_freelist() on each retry, as some allocations cannot do that
as part of direct reclaim or compaction, and yet are not allowed to fail
(even though they do a WARN_ON_ONCE() and thus should not exist). We can reuse
the call meant for ALLOC_NO_WATERMARKS attempt and just set alloc_flags to
ALLOC_NO_WATERMARKS if the context allows it. As a side-effect, the attempts
from direct reclaim/compaction will also no longer obey watermarks once this
is set, but there's little harm in that.

Kswapd wakeups are also done on each retry to be safe from potential races
resulting in kswapd going to sleep while a process (that may not be able to
reclaim by itself) is still looping.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 29 ++++++++++++++++++-----------
 1 file changed, 18 insertions(+), 11 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 82545274adbe..06cfa4bb807d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3582,35 +3582,42 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 */
 	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
+	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
+		wake_all_kswapds(order, ac);
+
+	/*
+	 * The adjusted alloc_flags might result in immediate success, so try
+	 * that first
+	 */
+	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
+	if (page)
+		goto got_pg;
+
+
 retry:
+	/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
 	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
 		wake_all_kswapds(order, ac);
 
+	if (gfp_pfmemalloc_allowed(gfp_mask))
+		alloc_flags = ALLOC_NO_WATERMARKS;
+
 	/*
 	 * Reset the zonelist iterators if memory policies can be ignored.
 	 * These allocations are high priority and system rather than user
 	 * orientated.
 	 */
-	if (!(alloc_flags & ALLOC_CPUSET) || gfp_pfmemalloc_allowed(gfp_mask)) {
+	if (!(alloc_flags & ALLOC_CPUSET) || (alloc_flags & ALLOC_NO_WATERMARKS)) {
 		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
 		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
 					ac->high_zoneidx, ac->nodemask);
 	}
 
-	/* This is the last chance, in general, before the goto nopage. */
+	/* Attempt with potentially adjusted zonelist and alloc_flags */
 	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
 	if (page)
 		goto got_pg;
 
-	/* Allocate without watermarks if the context allows */
-	if (gfp_pfmemalloc_allowed(gfp_mask)) {
-
-		page = get_page_from_freelist(gfp_mask, order,
-						ALLOC_NO_WATERMARKS, ac);
-		if (page)
-			goto got_pg;
-	}
-
 	/* Caller is not willing to reclaim, we can't balance anything */
 	if (!can_direct_reclaim) {
 		/*
-- 
2.8.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 04/17] mm, page_alloc: restructure direct compaction handling in slowpath
  2016-06-24  9:54 [PATCH v3 00/17] make direct compaction more deterministic Vlastimil Babka
                   ` (2 preceding siblings ...)
  2016-06-24  9:54 ` [PATCH v3 03/17] mm, page_alloc: don't retry initial attempt " Vlastimil Babka
@ 2016-06-24  9:54 ` Vlastimil Babka
  2016-06-24  9:54 ` [PATCH v3 05/17] mm, page_alloc: make THP-specific decisions more generic Vlastimil Babka
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 37+ messages in thread
From: Vlastimil Babka @ 2016-06-24  9:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Mel Gorman, Joonsoo Kim,
	David Rientjes, Rik van Riel, Vlastimil Babka

The retry loop in __alloc_pages_slowpath is supposed to keep trying reclaim
and compaction (and OOM), until either the allocation succeeds, or returns
with failure. Success here is more probable when reclaim precedes compaction,
as certain watermarks have to be met for compaction to even try, and more free
pages increase the probability of compaction success. On the other hand,
starting with light async compaction (if the watermarks allow it), can be
more efficient, especially for smaller orders, if there's enough free memory
which is just fragmented.

Thus, the current code starts with compaction before reclaim, and to make sure
that the last reclaim is always followed by a final compaction, there's another
direct compaction call at the end of the loop. This makes the code hard to
follow and adds some duplicated handling of migration_mode decisions. It's also
somewhat inefficient that even if reclaim or compaction decides not to retry,
the final compaction is still attempted. Some gfp flags combination also
shortcut these retry decisions by "goto noretry;", making it even harder to
follow.

This patch attempts to restructure the code with only minimal functional
changes. The call to the first compaction and THP-specific checks are now
placed above the retry loop, and the "noretry" direct compaction is removed.

The initial compaction is additionally restricted only to costly orders, as we
can expect smaller orders to be held back by watermarks, and only larger orders
to suffer primarily from fragmentation. This better matches the checks in
reclaim's shrink_zones().

There are two other smaller functional changes. One is that the upgrade from
async migration to light sync migration will always occur after the initial
compaction. This is how it has been until recent patch "mm, oom: protect
!costly allocations some more", which introduced upgrading the mode based on
COMPACT_COMPLETE result, but kept the final compaction always upgraded, which
made it even more special. It's better to return to the simpler handling for
now, as migration modes will be further modified later in the series.

The second change is that once both reclaim and compaction declare it's not
worth to retry the reclaim/compact loop, there is no final compaction attempt.
As argued above, this is intentional. If that final compaction were to succeed,
it would be due to a wrong retry decision, or simply a race with somebody else
freeing memory for us.

The main outcome of this patch should be simpler code. Logically, the initial
compaction without reclaim is the exceptional case to the reclaim/compaction
scheme, but prior to the patch, it was the last loop iteration that was
exceptional. Now the code matches the logic better. The change also enable the
following patches.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 106 +++++++++++++++++++++++++++++---------------------------
 1 file changed, 54 insertions(+), 52 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 06cfa4bb807d..0a3a7a9dbdff 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3551,7 +3551,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct page *page = NULL;
 	unsigned int alloc_flags;
 	unsigned long did_some_progress;
-	enum migrate_mode migration_mode = MIGRATE_ASYNC;
+	enum migrate_mode migration_mode = MIGRATE_SYNC_LIGHT;
 	enum compact_result compact_result;
 	int compaction_retries = 0;
 	int no_progress_loops = 0;
@@ -3593,6 +3593,49 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (page)
 		goto got_pg;
 
+	/*
+	 * For costly allocations, try direct compaction first, as it's likely
+	 * that we have enough base pages and don't need to reclaim.
+	 */
+	if (can_direct_reclaim && order > PAGE_ALLOC_COSTLY_ORDER) {
+		page = __alloc_pages_direct_compact(gfp_mask, order,
+						alloc_flags, ac,
+						MIGRATE_ASYNC,
+						&compact_result);
+		if (page)
+			goto got_pg;
+
+		/* Checks for THP-specific high-order allocations */
+		if (is_thp_gfp_mask(gfp_mask)) {
+			/*
+			 * If compaction is deferred for high-order allocations,
+			 * it is because sync compaction recently failed. If
+			 * this is the case and the caller requested a THP
+			 * allocation, we do not want to heavily disrupt the
+			 * system, so we fail the allocation instead of entering
+			 * direct reclaim.
+			 */
+			if (compact_result == COMPACT_DEFERRED)
+				goto nopage;
+
+			/*
+			 * Compaction is contended so rather back off than cause
+			 * excessive stalls.
+			 */
+			if (compact_result == COMPACT_CONTENDED)
+				goto nopage;
+
+			/*
+			 * It can become very expensive to allocate transparent
+			 * hugepages at fault, so use asynchronous memory
+			 * compaction for THP unless it is khugepaged trying to
+			 * collapse. All other requests should tolerate at
+			 * least light sync migration.
+			 */
+			if (!(current->flags & PF_KTHREAD))
+				migration_mode = MIGRATE_ASYNC;
+		}
+	}
 
 retry:
 	/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
@@ -3647,55 +3690,33 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
 		goto nopage;
 
-	/*
-	 * Try direct compaction. The first pass is asynchronous. Subsequent
-	 * attempts after direct reclaim are synchronous
-	 */
+
+	/* Try direct reclaim and then allocating */
+	page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
+							&did_some_progress);
+	if (page)
+		goto got_pg;
+
+	/* Try direct compaction and then allocating */
 	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
 					migration_mode,
 					&compact_result);
 	if (page)
 		goto got_pg;
 
-	/* Checks for THP-specific high-order allocations */
-	if (is_thp_gfp_mask(gfp_mask)) {
-		/*
-		 * If compaction is deferred for high-order allocations, it is
-		 * because sync compaction recently failed. If this is the case
-		 * and the caller requested a THP allocation, we do not want
-		 * to heavily disrupt the system, so we fail the allocation
-		 * instead of entering direct reclaim.
-		 */
-		if (compact_result == COMPACT_DEFERRED)
-			goto nopage;
-
-		/*
-		 * Compaction is contended so rather back off than cause
-		 * excessive stalls.
-		 */
-		if(compact_result == COMPACT_CONTENDED)
-			goto nopage;
-	}
-
 	if (order && compaction_made_progress(compact_result))
 		compaction_retries++;
 
-	/* Try direct reclaim and then allocating */
-	page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
-							&did_some_progress);
-	if (page)
-		goto got_pg;
-
 	/* Do not loop if specifically requested */
 	if (gfp_mask & __GFP_NORETRY)
-		goto noretry;
+		goto nopage;
 
 	/*
 	 * Do not retry costly high order allocations unless they are
 	 * __GFP_REPEAT
 	 */
 	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
-		goto noretry;
+		goto nopage;
 
 	/*
 	 * Costly allocations might have made a progress but this doesn't mean
@@ -3734,25 +3755,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto retry;
 	}
 
-noretry:
-	/*
-	 * High-order allocations do not necessarily loop after direct reclaim
-	 * and reclaim/compaction depends on compaction being called after
-	 * reclaim so call directly if necessary.
-	 * It can become very expensive to allocate transparent hugepages at
-	 * fault, so use asynchronous memory compaction for THP unless it is
-	 * khugepaged trying to collapse. All other requests should tolerate
-	 * at least light sync migration.
-	 */
-	if (is_thp_gfp_mask(gfp_mask) && !(current->flags & PF_KTHREAD))
-		migration_mode = MIGRATE_ASYNC;
-	else
-		migration_mode = MIGRATE_SYNC_LIGHT;
-	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
-					    ac, migration_mode,
-					    &compact_result);
-	if (page)
-		goto got_pg;
 nopage:
 	warn_alloc_failed(gfp_mask, order, NULL);
 got_pg:
-- 
2.8.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 05/17] mm, page_alloc: make THP-specific decisions more generic
  2016-06-24  9:54 [PATCH v3 00/17] make direct compaction more deterministic Vlastimil Babka
                   ` (3 preceding siblings ...)
  2016-06-24  9:54 ` [PATCH v3 04/17] mm, page_alloc: restructure direct compaction handling " Vlastimil Babka
@ 2016-06-24  9:54 ` Vlastimil Babka
  2016-06-24  9:54 ` [PATCH v3 06/17] mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations Vlastimil Babka
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 37+ messages in thread
From: Vlastimil Babka @ 2016-06-24  9:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Mel Gorman, Joonsoo Kim,
	David Rientjes, Rik van Riel, Vlastimil Babka

Since THP allocations during page faults can be costly, extra decisions are
employed for them to avoid excessive reclaim and compaction, if the initial
compaction doesn't look promising. The detection has never been perfect as
there is no gfp flag specific to THP allocations. At this moment it checks the
whole combination of flags that makes up GFP_TRANSHUGE, and hopes that no other
users of such combination exist, or would mind being treated the same way.
Extra care is also taken to separate allocations from khugepaged, where latency
doesn't matter that much.

It is however possible to distinguish these allocations in a simpler and more
reliable way. The key observation is that after the initial compaction followed
by the first iteration of "standard" reclaim/compaction, both __GFP_NORETRY
allocations and costly allocations without __GFP_REPEAT are declared as
failures:

        /* Do not loop if specifically requested */
        if (gfp_mask & __GFP_NORETRY)
                goto nopage;

        /*
         * Do not retry costly high order allocations unless they are
         * __GFP_REPEAT
         */
        if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
                goto nopage;

This means we can further distinguish allocations that are costly order *and*
additionally include the __GFP_NORETRY flag. As it happens, GFP_TRANSHUGE
allocations do already fall into this category. This will also allow other
costly allocations with similar high-order benefit vs latency considerations to
use this semantic. Furthermore, we can distinguish THP allocations that should
try a bit harder (such as from khugepageed) by removing __GFP_NORETRY, as will
be done in the next patch.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/page_alloc.c | 22 +++++++++-------------
 1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0a3a7a9dbdff..246cba86b257 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3159,7 +3159,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
-
 /*
  * Maximum number of compaction retries wit a progress before OOM
  * killer is consider as the only way to move forward.
@@ -3443,11 +3442,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	return false;
 }
 
-static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
-{
-	return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
-}
-
 /*
  * Maximum number of reclaim retries without any progress before OOM killer
  * is consider as the only way to move forward.
@@ -3605,8 +3599,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		if (page)
 			goto got_pg;
 
-		/* Checks for THP-specific high-order allocations */
-		if (is_thp_gfp_mask(gfp_mask)) {
+		/*
+		 * Checks for costly allocations with __GFP_NORETRY, which
+		 * includes THP page fault allocations
+		 */
+		if (gfp_mask & __GFP_NORETRY) {
 			/*
 			 * If compaction is deferred for high-order allocations,
 			 * it is because sync compaction recently failed. If
@@ -3626,11 +3623,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				goto nopage;
 
 			/*
-			 * It can become very expensive to allocate transparent
-			 * hugepages at fault, so use asynchronous memory
-			 * compaction for THP unless it is khugepaged trying to
-			 * collapse. All other requests should tolerate at
-			 * least light sync migration.
+			 * Looks like reclaim/compaction is worth trying, but
+			 * sync compaction could be very expensive, so keep
+			 * using async compaction, unless it's khugepaged
+			 * trying to collapse.
 			 */
 			if (!(current->flags & PF_KTHREAD))
 				migration_mode = MIGRATE_ASYNC;
-- 
2.8.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 06/17] mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
  2016-06-24  9:54 [PATCH v3 00/17] make direct compaction more deterministic Vlastimil Babka
                   ` (4 preceding siblings ...)
  2016-06-24  9:54 ` [PATCH v3 05/17] mm, page_alloc: make THP-specific decisions more generic Vlastimil Babka
@ 2016-06-24  9:54 ` Vlastimil Babka
  2016-06-24  9:54 ` [PATCH v3 07/17] mm, compaction: introduce direct compaction priority Vlastimil Babka
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 37+ messages in thread
From: Vlastimil Babka @ 2016-06-24  9:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Mel Gorman, Joonsoo Kim,
	David Rientjes, Rik van Riel, Vlastimil Babka

After the previous patch, we can distinguish costly allocations that should be
really lightweight, such as THP page faults, with __GFP_NORETRY. This means we
don't need to recognize khugepaged allocations via PF_KTHREAD anymore. We can
also change THP page faults in areas where madvise(MADV_HUGEPAGE) was used to
try as hard as khugepaged, as the process has indicated that it benefits from
THP's and is willing to pay some initial latency costs.

We can also make the flags handling less cryptic by distinguishing
GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from
GFP_TRANSHUGE (only direct reclaim, khugepaged default). Adding __GFP_NORETRY
or __GFP_KSWAPD_RECLAIM is done where needed.

The patch effectively changes the current GFP_TRANSHUGE users as follows:

* get_huge_zero_page() - the zero page lifetime should be relatively long and
  it's shared by multiple users, so it's worth spending some effort on it.
  We use GFP_TRANSHUGE, and __GFP_NORETRY is not added. This also restores
  direct reclaim to this allocation, which was unintentionally removed by
  commit e4a49efe4e7e ("mm: thp: set THP defrag by default to madvise and add
  a stall-free defrag option")

* alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency is not
  an issue. So if khugepaged "defrag" is enabled (the default), do reclaim
  via GFP_TRANSHUGE without __GFP_NORETRY. We can remove the PF_KTHREAD check
  from page alloc.
  As a side-effect, khugepaged will now no longer check if the initial
  compaction was deferred or contended. This is OK, as khugepaged sleep times
  between collapsion attempts are long enough to prevent noticeable disruption,
  so we should allow it to spend some effort.

* migrate_misplaced_transhuge_page() - already was masking out __GFP_RECLAIM,
  so just convert to GFP_TRANSHUGE_LIGHT which is equivalent.

* alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise) are
  now allocating without __GFP_NORETRY. Other vma's keep using __GFP_NORETRY
  if direct reclaim/compaction is at all allowed (by default it's allowed only
  for madvised vma's). The rest is conversion to GFP_TRANSHUGE(_LIGHT).

[mhocko@suse.com: suggested GFP_TRANSHUGE_LIGHT]
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/gfp.h            | 14 ++++++++------
 include/trace/events/mmflags.h |  1 +
 mm/huge_memory.c               | 29 ++++++++++++++++-------------
 mm/khugepaged.c                |  2 +-
 mm/migrate.c                   |  2 +-
 mm/page_alloc.c                |  6 ++----
 tools/perf/builtin-kmem.c      |  1 +
 7 files changed, 30 insertions(+), 25 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index c29e9d347bc6..f8041f9de31e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -237,9 +237,11 @@ struct vm_area_struct;
  *   are expected to be movable via page reclaim or page migration. Typically,
  *   pages on the LRU would also be allocated with GFP_HIGHUSER_MOVABLE.
  *
- * GFP_TRANSHUGE is used for THP allocations. They are compound allocations
- *   that will fail quickly if memory is not available and will not wake
- *   kswapd on failure.
+ * GFP_TRANSHUGE and GFP_TRANSHUGE_LIGHT are used for THP allocations. They are
+ *   compound allocations that will generally fail quickly if memory is not
+ *   available and will not wake kswapd/kcompactd on failure. The _LIGHT
+ *   version does not attempt reclaim/compaction at all and is by default used
+ *   in page fault path, while the non-light is used by khugepaged.
  */
 #define GFP_ATOMIC	(__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
 #define GFP_KERNEL	(__GFP_RECLAIM | __GFP_IO | __GFP_FS)
@@ -254,9 +256,9 @@ struct vm_area_struct;
 #define GFP_DMA32	__GFP_DMA32
 #define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
 #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
-#define GFP_TRANSHUGE	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
-			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
-			 ~__GFP_RECLAIM)
+#define GFP_TRANSHUGE_LIGHT	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
+			 __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)
+#define GFP_TRANSHUGE	(GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM)
 
 /* Convert GFP flags to their corresponding migrate type */
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 43cedbf0c759..5a81ab48a2fb 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -11,6 +11,7 @@
 
 #define __def_gfpflag_names						\
 	{(unsigned long)GFP_TRANSHUGE,		"GFP_TRANSHUGE"},	\
+	{(unsigned long)GFP_TRANSHUGE_LIGHT,	"GFP_TRANSHUGE_LIGHT"}, \
 	{(unsigned long)GFP_HIGHUSER_MOVABLE,	"GFP_HIGHUSER_MOVABLE"},\
 	{(unsigned long)GFP_HIGHUSER,		"GFP_HIGHUSER"},	\
 	{(unsigned long)GFP_USER,		"GFP_USER"},		\
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6eff8b123a88..4c89970ef1d8 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -539,23 +539,26 @@ static int __do_huge_pmd_anonymous_page(struct fault_env *fe, struct page *page,
 }
 
 /*
- * If THP is set to always then directly reclaim/compact as necessary
- * If set to defer then do no reclaim and defer to khugepaged
+ * If THP defrag is set to always then directly reclaim/compact as necessary
+ * If set to defer then do only background reclaim/compact and defer to khugepaged
  * If set to madvise and the VMA is flagged then directly reclaim/compact
+ * When direct reclaim/compact is allowed, don't retry except for flagged VMA's
  */
 static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 {
-	gfp_t reclaim_flags = 0;
-
-	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags) &&
-	    (vma->vm_flags & VM_HUGEPAGE))
-		reclaim_flags = __GFP_DIRECT_RECLAIM;
-	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
-		reclaim_flags = __GFP_KSWAPD_RECLAIM;
-	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		reclaim_flags = __GFP_DIRECT_RECLAIM;
-
-	return GFP_TRANSHUGE | reclaim_flags;
+	bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
+
+	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
+				&transparent_hugepage_flags) && vma_madvised)
+		return GFP_TRANSHUGE;
+	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
+						&transparent_hugepage_flags))
+		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
+	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
+						&transparent_hugepage_flags))
+		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
+
+	return GFP_TRANSHUGE_LIGHT;
 }
 
 /* Caller must hold page table lock. */
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 93d5f87c00d5..555d860b9543 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -694,7 +694,7 @@ static bool khugepaged_scan_abort(int nid)
 /* Defrag for khugepaged will enter direct reclaim/compaction if necessary */
 static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
 {
-	return GFP_TRANSHUGE | (khugepaged_defrag() ? __GFP_DIRECT_RECLAIM : 0);
+	return khugepaged_defrag() ? GFP_TRANSHUGE : GFP_TRANSHUGE_LIGHT;
 }
 
 #ifdef CONFIG_NUMA
diff --git a/mm/migrate.c b/mm/migrate.c
index c7531ccf65f4..e3f933b08535 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1929,7 +1929,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 		goto out_dropref;
 
 	new_page = alloc_pages_node(node,
-		(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
+		(GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
 		HPAGE_PMD_ORDER);
 	if (!new_page)
 		goto out_fail;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 246cba86b257..8ea9d1be54e9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3625,11 +3625,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 			/*
 			 * Looks like reclaim/compaction is worth trying, but
 			 * sync compaction could be very expensive, so keep
-			 * using async compaction, unless it's khugepaged
-			 * trying to collapse.
+			 * using async compaction.
 			 */
-			if (!(current->flags & PF_KTHREAD))
-				migration_mode = MIGRATE_ASYNC;
+			migration_mode = MIGRATE_ASYNC;
 		}
 	}
 
diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c
index c9cb3be47cff..0d98182dc159 100644
--- a/tools/perf/builtin-kmem.c
+++ b/tools/perf/builtin-kmem.c
@@ -608,6 +608,7 @@ static const struct {
 	const char *compact;
 } gfp_compact_table[] = {
 	{ "GFP_TRANSHUGE",		"THP" },
+	{ "GFP_TRANSHUGE_LIGHT",	"THL" },
 	{ "GFP_HIGHUSER_MOVABLE",	"HUM" },
 	{ "GFP_HIGHUSER",		"HU" },
 	{ "GFP_USER",			"U" },
-- 
2.8.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 07/17] mm, compaction: introduce direct compaction priority
  2016-06-24  9:54 [PATCH v3 00/17] make direct compaction more deterministic Vlastimil Babka
                   ` (5 preceding siblings ...)
  2016-06-24  9:54 ` [PATCH v3 06/17] mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations Vlastimil Babka
@ 2016-06-24  9:54 ` Vlastimil Babka
  2016-06-24 11:39   ` kbuild test robot
  2016-06-24  9:54 ` [PATCH v3 08/17] mm, compaction: simplify contended compaction handling Vlastimil Babka
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 37+ messages in thread
From: Vlastimil Babka @ 2016-06-24  9:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Mel Gorman, Joonsoo Kim,
	David Rientjes, Rik van Riel, Vlastimil Babka

In the context of direct compaction, for some types of allocations we would
like the compaction to either succeed or definitely fail while trying as hard
as possible. Current async/sync_light migration mode is insufficient, as there
are heuristics such as caching scanner positions, marking pageblocks as
unsuitable or deferring compaction for a zone. At least the final compaction
attempt should be able to override these heuristics.

To communicate how hard compaction should try, we replace migration mode with
a new enum compact_priority and change the relevant function signatures. In
compact_zone_order() where struct compact_control is constructed, the priority
is mapped to suitable control flags. This patch itself has no functional
change, as the current priority levels are mapped back to the same migration
modes as before. Expanding them will be done next.

Note that !CONFIG_COMPACTION variant of try_to_compact_pages() is removed, as
the only caller exists under CONFIG_COMPACTION.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/compaction.h        | 29 +++++++++++++++++++++++------
 include/trace/events/compaction.h | 12 ++++++------
 mm/compaction.c                   | 13 +++++++------
 mm/page_alloc.c                   | 28 ++++++++++++++--------------
 4 files changed, 50 insertions(+), 32 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 1a02dab16646..b470765ed9e6 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -1,6 +1,18 @@
 #ifndef _LINUX_COMPACTION_H
 #define _LINUX_COMPACTION_H
 
+/*
+ * Determines how hard direct compaction should try to succeed.
+ * Lower value means higher priority, analogically to reclaim priority.
+ */
+enum compact_priority {
+	COMPACT_PRIO_SYNC_LIGHT,
+	MIN_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
+	DEF_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
+	COMPACT_PRIO_ASYNC,
+	INIT_COMPACT_PRIORITY = COMPACT_PRIO_ASYNC
+};
+
 /* Return values for compact_zone() and try_to_compact_pages() */
 /* When adding new states, please adjust include/trace/events/compaction.h */
 enum compact_result {
@@ -66,7 +78,7 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
 			unsigned int order,
 		unsigned int alloc_flags, const struct alloc_context *ac,
-		enum migrate_mode mode, int *contended);
+		enum compact_priority prio, int *contended);
 extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern enum compact_result compaction_suitable(struct zone *zone, int order,
@@ -151,12 +163,17 @@ extern void kcompactd_stop(int nid);
 extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
 
 #else
-static inline enum compact_result try_to_compact_pages(gfp_t gfp_mask,
-			unsigned int order, int alloc_flags,
-			const struct alloc_context *ac,
-			enum migrate_mode mode, int *contended)
+static inline int PageMovable(struct page *page)
+{
+	return 0;
+}
+static inline void __SetPageMovable(struct page *page,
+			struct address_space *mapping)
+{
+}
+
+static inline void __ClearPageMovable(struct page *page)
 {
-	return COMPACT_CONTINUE;
 }
 
 static inline void compact_pgdat(pg_data_t *pgdat, int order)
diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
index 36e2d6fb1360..c2ba402ab256 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -226,26 +226,26 @@ TRACE_EVENT(mm_compaction_try_to_compact_pages,
 	TP_PROTO(
 		int order,
 		gfp_t gfp_mask,
-		enum migrate_mode mode),
+		int prio),
 
-	TP_ARGS(order, gfp_mask, mode),
+	TP_ARGS(order, gfp_mask, prio),
 
 	TP_STRUCT__entry(
 		__field(int, order)
 		__field(gfp_t, gfp_mask)
-		__field(enum migrate_mode, mode)
+		__field(int, prio)
 	),
 
 	TP_fast_assign(
 		__entry->order = order;
 		__entry->gfp_mask = gfp_mask;
-		__entry->mode = mode;
+		__entry->prio = prio;
 	),
 
-	TP_printk("order=%d gfp_mask=0x%x mode=%d",
+	TP_printk("order=%d gfp_mask=0x%x priority=%d",
 		__entry->order,
 		__entry->gfp_mask,
-		(int)__entry->mode)
+		__entry->prio)
 );
 
 DECLARE_EVENT_CLASS(mm_compaction_suitable_template,
diff --git a/mm/compaction.c b/mm/compaction.c
index b7b696e46eaa..4ed4f3232d8b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1630,7 +1630,7 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
 }
 
 static enum compact_result compact_zone_order(struct zone *zone, int order,
-		gfp_t gfp_mask, enum migrate_mode mode, int *contended,
+		gfp_t gfp_mask, enum compact_priority prio, int *contended,
 		unsigned int alloc_flags, int classzone_idx)
 {
 	enum compact_result ret;
@@ -1640,7 +1640,8 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
 		.order = order,
 		.gfp_mask = gfp_mask,
 		.zone = zone,
-		.mode = mode,
+		.mode = (prio == COMPACT_PRIO_ASYNC) ?
+					MIGRATE_ASYNC :	MIGRATE_SYNC_LIGHT,
 		.alloc_flags = alloc_flags,
 		.classzone_idx = classzone_idx,
 		.direct_compaction = true,
@@ -1673,7 +1674,7 @@ int sysctl_extfrag_threshold = 500;
  */
 enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 		unsigned int alloc_flags, const struct alloc_context *ac,
-		enum migrate_mode mode, int *contended)
+		enum compact_priority prio, int *contended)
 {
 	int may_enter_fs = gfp_mask & __GFP_FS;
 	int may_perform_io = gfp_mask & __GFP_IO;
@@ -1688,7 +1689,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 	if (!order || !may_enter_fs || !may_perform_io)
 		return COMPACT_SKIPPED;
 
-	trace_mm_compaction_try_to_compact_pages(order, gfp_mask, mode);
+	trace_mm_compaction_try_to_compact_pages(order, gfp_mask, prio);
 
 	/* Compact each zone in the list */
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
@@ -1701,7 +1702,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 			continue;
 		}
 
-		status = compact_zone_order(zone, order, gfp_mask, mode,
+		status = compact_zone_order(zone, order, gfp_mask, prio,
 				&zone_contended, alloc_flags,
 				ac_classzone_idx(ac));
 		rc = max(status, rc);
@@ -1735,7 +1736,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 			goto break_loop;
 		}
 
-		if (mode != MIGRATE_ASYNC && (status == COMPACT_COMPLETE ||
+		if (prio != COMPACT_PRIO_ASYNC && (status == COMPACT_COMPLETE ||
 					status == COMPACT_PARTIAL_SKIPPED)) {
 			/*
 			 * We think that allocation won't succeed in this zone
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8ea9d1be54e9..fc0f2a3d4e5c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3170,7 +3170,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 static struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		unsigned int alloc_flags, const struct alloc_context *ac,
-		enum migrate_mode mode, enum compact_result *compact_result)
+		enum compact_priority prio, enum compact_result *compact_result)
 {
 	struct page *page;
 	int contended_compaction;
@@ -3180,7 +3180,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 
 	current->flags |= PF_MEMALLOC;
 	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
-						mode, &contended_compaction);
+						prio, &contended_compaction);
 	current->flags &= ~PF_MEMALLOC;
 
 	if (*compact_result <= COMPACT_INACTIVE)
@@ -3234,7 +3234,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 
 static inline bool
 should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
-		     enum compact_result compact_result, enum migrate_mode *migrate_mode,
+		     enum compact_result compact_result,
+		     enum compact_priority *compact_priority,
 		     int compaction_retries)
 {
 	int max_retries = MAX_COMPACT_RETRIES;
@@ -3245,11 +3246,11 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
 	/*
 	 * compaction considers all the zone as desperately out of memory
 	 * so it doesn't really make much sense to retry except when the
-	 * failure could be caused by weak migration mode.
+	 * failure could be caused by insufficient priority
 	 */
 	if (compaction_failed(compact_result)) {
-		if (*migrate_mode == MIGRATE_ASYNC) {
-			*migrate_mode = MIGRATE_SYNC_LIGHT;
+		if (*compact_priority > MIN_COMPACT_PRIORITY) {
+			(*compact_priority)--;
 			return true;
 		}
 		return false;
@@ -3283,7 +3284,7 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
 static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		unsigned int alloc_flags, const struct alloc_context *ac,
-		enum migrate_mode mode, enum compact_result *compact_result)
+		enum compact_priority prio, enum compact_result *compact_result)
 {
 	*compact_result = COMPACT_SKIPPED;
 	return NULL;
@@ -3292,7 +3293,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 static inline bool
 should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_flags,
 		     enum compact_result compact_result,
-		     enum migrate_mode *migrate_mode,
+		     enum compact_priority *compact_priority,
 		     int compaction_retries)
 {
 	struct zone *zone;
@@ -3545,7 +3546,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct page *page = NULL;
 	unsigned int alloc_flags;
 	unsigned long did_some_progress;
-	enum migrate_mode migration_mode = MIGRATE_SYNC_LIGHT;
+	enum compact_priority compact_priority = DEF_COMPACT_PRIORITY;
 	enum compact_result compact_result;
 	int compaction_retries = 0;
 	int no_progress_loops = 0;
@@ -3594,7 +3595,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (can_direct_reclaim && order > PAGE_ALLOC_COSTLY_ORDER) {
 		page = __alloc_pages_direct_compact(gfp_mask, order,
 						alloc_flags, ac,
-						MIGRATE_ASYNC,
+						INIT_COMPACT_PRIORITY,
 						&compact_result);
 		if (page)
 			goto got_pg;
@@ -3627,7 +3628,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 			 * sync compaction could be very expensive, so keep
 			 * using async compaction.
 			 */
-			migration_mode = MIGRATE_ASYNC;
+			compact_priority = INIT_COMPACT_PRIORITY;
 		}
 	}
 
@@ -3693,8 +3694,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 
 	/* Try direct compaction and then allocating */
 	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
-					migration_mode,
-					&compact_result);
+					compact_priority, &compact_result);
 	if (page)
 		goto got_pg;
 
@@ -3734,7 +3734,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 */
 	if (did_some_progress > 0 &&
 			should_compact_retry(ac, order, alloc_flags,
-				compact_result, &migration_mode,
+				compact_result, &compact_priority,
 				compaction_retries))
 		goto retry;
 
-- 
2.8.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 08/17] mm, compaction: simplify contended compaction handling
  2016-06-24  9:54 [PATCH v3 00/17] make direct compaction more deterministic Vlastimil Babka
                   ` (6 preceding siblings ...)
  2016-06-24  9:54 ` [PATCH v3 07/17] mm, compaction: introduce direct compaction priority Vlastimil Babka
@ 2016-06-24  9:54 ` Vlastimil Babka
  2016-06-24  9:54 ` [PATCH v3 09/17] mm, compaction: make whole_zone flag ignore cached scanner positions Vlastimil Babka
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 37+ messages in thread
From: Vlastimil Babka @ 2016-06-24  9:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Mel Gorman, Joonsoo Kim,
	David Rientjes, Rik van Riel, Vlastimil Babka

Async compaction detects contention either due to failing trylock on zone->lock
or lru_lock, or by need_resched(). Since 1f9efdef4f3f ("mm, compaction:
khugepaged should not give up due to need_resched()") the code got quite
complicated to distinguish these two up to the __alloc_pages_slowpath() level,
so different decisions could be taken for khugepaged allocations.

After the recent changes, khugepaged allocations don't check for contended
compaction anymore, so we again don't need to distinguish lock and sched
contention, and simplify the current convoluted code a lot.

However, I believe it's also possible to simplify even more and completely
remove the check for contended compaction after the initial async compaction
for costly orders, which was originally aimed at THP page fault allocations.
There are several reasons why this can be done now:

- with the new defaults, THP page faults no longer do reclaim/compaction at
  all, unless the system admin has overridden the default, or application has
  indicated via madvise that it can benefit from THP's. In both cases, it
  means that the potential extra latency is expected and worth the benefits.
- even if reclaim/compaction proceeds after this patch where it previously
  wouldn't, the second compaction attempt is still async and will detect the
  contention and back off, if the contention persists
- there are still heuristics like deferred compaction and pageblock skip bits
  in place that prevent excessive THP page fault latencies

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/compaction.h | 13 ++-------
 mm/compaction.c            | 72 +++++++++-------------------------------------
 mm/internal.h              |  5 +---
 mm/page_alloc.c            | 28 +-----------------
 4 files changed, 17 insertions(+), 101 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index b470765ed9e6..095aaa220952 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -55,14 +55,6 @@ enum compact_result {
 	COMPACT_PARTIAL,
 };
 
-/* Used to signal whether compaction detected need_sched() or lock contention */
-/* No contention detected */
-#define COMPACT_CONTENDED_NONE	0
-/* Either need_sched() was true or fatal signal pending */
-#define COMPACT_CONTENDED_SCHED	1
-/* Zone lock or lru_lock was contended in async compaction */
-#define COMPACT_CONTENDED_LOCK	2
-
 struct alloc_context; /* in mm/internal.h */
 
 #ifdef CONFIG_COMPACTION
@@ -76,9 +68,8 @@ extern int sysctl_compact_unevictable_allowed;
 
 extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
-			unsigned int order,
-		unsigned int alloc_flags, const struct alloc_context *ac,
-		enum compact_priority prio, int *contended);
+		unsigned int order, unsigned int alloc_flags,
+		const struct alloc_context *ac, enum compact_priority prio);
 extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern enum compact_result compaction_suitable(struct zone *zone, int order,
diff --git a/mm/compaction.c b/mm/compaction.c
index 4ed4f3232d8b..f825a58bc37c 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -331,7 +331,7 @@ static bool compact_trylock_irqsave(spinlock_t *lock, unsigned long *flags,
 {
 	if (cc->mode == MIGRATE_ASYNC) {
 		if (!spin_trylock_irqsave(lock, *flags)) {
-			cc->contended = COMPACT_CONTENDED_LOCK;
+			cc->contended = true;
 			return false;
 		}
 	} else {
@@ -365,13 +365,13 @@ static bool compact_unlock_should_abort(spinlock_t *lock,
 	}
 
 	if (fatal_signal_pending(current)) {
-		cc->contended = COMPACT_CONTENDED_SCHED;
+		cc->contended = true;
 		return true;
 	}
 
 	if (need_resched()) {
 		if (cc->mode == MIGRATE_ASYNC) {
-			cc->contended = COMPACT_CONTENDED_SCHED;
+			cc->contended = true;
 			return true;
 		}
 		cond_resched();
@@ -394,7 +394,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
 	/* async compaction aborts if contended */
 	if (need_resched()) {
 		if (cc->mode == MIGRATE_ASYNC) {
-			cc->contended = COMPACT_CONTENDED_SCHED;
+			cc->contended = true;
 			return true;
 		}
 
@@ -1623,14 +1623,11 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
 	trace_mm_compaction_end(start_pfn, cc->migrate_pfn,
 				cc->free_pfn, end_pfn, sync, ret);
 
-	if (ret == COMPACT_CONTENDED)
-		ret = COMPACT_PARTIAL;
-
 	return ret;
 }
 
 static enum compact_result compact_zone_order(struct zone *zone, int order,
-		gfp_t gfp_mask, enum compact_priority prio, int *contended,
+		gfp_t gfp_mask, enum compact_priority prio,
 		unsigned int alloc_flags, int classzone_idx)
 {
 	enum compact_result ret;
@@ -1654,7 +1651,6 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
 	VM_BUG_ON(!list_empty(&cc.freepages));
 	VM_BUG_ON(!list_empty(&cc.migratepages));
 
-	*contended = cc.contended;
 	return ret;
 }
 
@@ -1667,23 +1663,18 @@ int sysctl_extfrag_threshold = 500;
  * @alloc_flags: The allocation flags of the current allocation
  * @ac: The context of current allocation
  * @mode: The migration mode for async, sync light, or sync migration
- * @contended: Return value that determines if compaction was aborted due to
- *	       need_resched() or lock contention
  *
  * This is the main entry point for direct page compaction.
  */
 enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 		unsigned int alloc_flags, const struct alloc_context *ac,
-		enum compact_priority prio, int *contended)
+		enum compact_priority prio)
 {
 	int may_enter_fs = gfp_mask & __GFP_FS;
 	int may_perform_io = gfp_mask & __GFP_IO;
 	struct zoneref *z;
 	struct zone *zone;
 	enum compact_result rc = COMPACT_SKIPPED;
-	int all_zones_contended = COMPACT_CONTENDED_LOCK; /* init for &= op */
-
-	*contended = COMPACT_CONTENDED_NONE;
 
 	/* Check if the GFP flags allow compaction */
 	if (!order || !may_enter_fs || !may_perform_io)
@@ -1695,7 +1686,6 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
 								ac->nodemask) {
 		enum compact_result status;
-		int zone_contended;
 
 		if (compaction_deferred(zone, order)) {
 			rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
@@ -1703,14 +1693,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 		}
 
 		status = compact_zone_order(zone, order, gfp_mask, prio,
-				&zone_contended, alloc_flags,
-				ac_classzone_idx(ac));
+					alloc_flags, ac_classzone_idx(ac));
 		rc = max(status, rc);
-		/*
-		 * It takes at least one zone that wasn't lock contended
-		 * to clear all_zones_contended.
-		 */
-		all_zones_contended &= zone_contended;
 
 		/* If a normal allocation would succeed, stop compacting */
 		if (zone_watermark_ok(zone, order, low_wmark_pages(zone),
@@ -1722,59 +1706,29 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 			 * succeeds in this zone.
 			 */
 			compaction_defer_reset(zone, order, false);
-			/*
-			 * It is possible that async compaction aborted due to
-			 * need_resched() and the watermarks were ok thanks to
-			 * somebody else freeing memory. The allocation can
-			 * however still fail so we better signal the
-			 * need_resched() contention anyway (this will not
-			 * prevent the allocation attempt).
-			 */
-			if (zone_contended == COMPACT_CONTENDED_SCHED)
-				*contended = COMPACT_CONTENDED_SCHED;
 
-			goto break_loop;
+			break;
 		}
 
 		if (prio != COMPACT_PRIO_ASYNC && (status == COMPACT_COMPLETE ||
-					status == COMPACT_PARTIAL_SKIPPED)) {
+					status == COMPACT_PARTIAL_SKIPPED))
 			/*
 			 * We think that allocation won't succeed in this zone
 			 * so we defer compaction there. If it ends up
 			 * succeeding after all, it will be reset.
 			 */
 			defer_compaction(zone, order);
-		}
 
 		/*
 		 * We might have stopped compacting due to need_resched() in
 		 * async compaction, or due to a fatal signal detected. In that
-		 * case do not try further zones and signal need_resched()
-		 * contention.
-		 */
-		if ((zone_contended == COMPACT_CONTENDED_SCHED)
-					|| fatal_signal_pending(current)) {
-			*contended = COMPACT_CONTENDED_SCHED;
-			goto break_loop;
-		}
-
-		continue;
-break_loop:
-		/*
-		 * We might not have tried all the zones, so  be conservative
-		 * and assume they are not all lock contended.
+		 * case do not try further zones
 		 */
-		all_zones_contended = 0;
-		break;
+		if ((prio == COMPACT_PRIO_ASYNC && need_resched())
+					|| fatal_signal_pending(current))
+			break;
 	}
 
-	/*
-	 * If at least one zone wasn't deferred or skipped, we report if all
-	 * zones that were tried were lock contended.
-	 */
-	if (rc > COMPACT_INACTIVE && all_zones_contended)
-		*contended = COMPACT_CONTENDED_LOCK;
-
 	return rc;
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index 9b6a6c43ac39..680e5ce2ab37 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -185,10 +185,7 @@ struct compact_control {
 	const unsigned int alloc_flags;	/* alloc flags of a direct compactor */
 	const int classzone_idx;	/* zone index of a direct compactor */
 	struct zone *zone;
-	int contended;			/* Signal need_sched() or lock
-					 * contention detected during
-					 * compaction
-					 */
+	bool contended;			/* Signal lock or sched contention */
 };
 
 unsigned long
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fc0f2a3d4e5c..204cc988fd64 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3173,14 +3173,13 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		enum compact_priority prio, enum compact_result *compact_result)
 {
 	struct page *page;
-	int contended_compaction;
 
 	if (!order)
 		return NULL;
 
 	current->flags |= PF_MEMALLOC;
 	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
-						prio, &contended_compaction);
+									prio);
 	current->flags &= ~PF_MEMALLOC;
 
 	if (*compact_result <= COMPACT_INACTIVE)
@@ -3209,24 +3208,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	 */
 	count_vm_event(COMPACTFAIL);
 
-	/*
-	 * In all zones where compaction was attempted (and not
-	 * deferred or skipped), lock contention has been detected.
-	 * For THP allocation we do not want to disrupt the others
-	 * so we fallback to base pages instead.
-	 */
-	if (contended_compaction == COMPACT_CONTENDED_LOCK)
-		*compact_result = COMPACT_CONTENDED;
-
-	/*
-	 * If compaction was aborted due to need_resched(), we do not
-	 * want to further increase allocation latency, unless it is
-	 * khugepaged trying to collapse.
-	 */
-	if (contended_compaction == COMPACT_CONTENDED_SCHED
-		&& !(current->flags & PF_KTHREAD))
-		*compact_result = COMPACT_CONTENDED;
-
 	cond_resched();
 
 	return NULL;
@@ -3617,13 +3598,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				goto nopage;
 
 			/*
-			 * Compaction is contended so rather back off than cause
-			 * excessive stalls.
-			 */
-			if (compact_result == COMPACT_CONTENDED)
-				goto nopage;
-
-			/*
 			 * Looks like reclaim/compaction is worth trying, but
 			 * sync compaction could be very expensive, so keep
 			 * using async compaction.
-- 
2.8.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 09/17] mm, compaction: make whole_zone flag ignore cached scanner positions
  2016-06-24  9:54 [PATCH v3 00/17] make direct compaction more deterministic Vlastimil Babka
                   ` (7 preceding siblings ...)
  2016-06-24  9:54 ` [PATCH v3 08/17] mm, compaction: simplify contended compaction handling Vlastimil Babka
@ 2016-06-24  9:54 ` Vlastimil Babka
  2016-07-06  5:09   ` Joonsoo Kim
  2016-06-24  9:54 ` [PATCH v3 10/17] mm, compaction: cleanup unused functions Vlastimil Babka
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 37+ messages in thread
From: Vlastimil Babka @ 2016-06-24  9:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Mel Gorman, Joonsoo Kim,
	David Rientjes, Rik van Riel, Vlastimil Babka

A recent patch has added whole_zone flag that compaction sets when scanning
starts from the zone boundary, in order to report that zone has been fully
scanned in one attempt. For allocations that want to try really hard or cannot
fail, we will want to introduce a mode where scanning whole zone is guaranteed
regardless of the cached positions.

This patch reuses the whole_zone flag in a way that if it's already passed true
to compaction, the cached scanner positions are ignored. Employing this flag
during reclaim/compaction loop will be done in the next patch. This patch
however converts compaction invoked from userspace via procfs to use this flag.
Before this patch, the cached positions were first reset to zone boundaries and
then read back from struct zone, so there was a window where a parallel
compaction could replace the reset values, making the manual compaction less
effective. Using the flag instead of performing reset is more robust.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/compaction.c | 15 +++++----------
 mm/internal.h   |  2 +-
 2 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index f825a58bc37c..e7fe848e318e 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1501,11 +1501,13 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
 	 */
 	cc->migrate_pfn = zone->compact_cached_migrate_pfn[sync];
 	cc->free_pfn = zone->compact_cached_free_pfn;
-	if (cc->free_pfn < start_pfn || cc->free_pfn >= end_pfn) {
+	if (cc->whole_zone || cc->free_pfn < start_pfn ||
+						cc->free_pfn >= end_pfn) {
 		cc->free_pfn = pageblock_start_pfn(end_pfn - 1);
 		zone->compact_cached_free_pfn = cc->free_pfn;
 	}
-	if (cc->migrate_pfn < start_pfn || cc->migrate_pfn >= end_pfn) {
+	if (cc->whole_zone || cc->migrate_pfn < start_pfn ||
+						cc->migrate_pfn >= end_pfn) {
 		cc->migrate_pfn = start_pfn;
 		zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn;
 		zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
@@ -1751,14 +1753,6 @@ static void __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
 		INIT_LIST_HEAD(&cc->freepages);
 		INIT_LIST_HEAD(&cc->migratepages);
 
-		/*
-		 * When called via /proc/sys/vm/compact_memory
-		 * this makes sure we compact the whole zone regardless of
-		 * cached scanner positions.
-		 */
-		if (is_via_compact_memory(cc->order))
-			__reset_isolation_suitable(zone);
-
 		if (is_via_compact_memory(cc->order) ||
 				!compaction_deferred(zone, cc->order))
 			compact_zone(zone, cc);
@@ -1794,6 +1788,7 @@ static void compact_node(int nid)
 		.order = -1,
 		.mode = MIGRATE_SYNC,
 		.ignore_skip_hint = true,
+		.whole_zone = true,
 	};
 
 	__compact_pgdat(NODE_DATA(nid), &cc);
diff --git a/mm/internal.h b/mm/internal.h
index 680e5ce2ab37..153bb52335b4 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -179,7 +179,7 @@ struct compact_control {
 	enum migrate_mode mode;		/* Async or sync migration mode */
 	bool ignore_skip_hint;		/* Scan blocks even if marked skip */
 	bool direct_compaction;		/* False from kcompactd or /proc/... */
-	bool whole_zone;		/* Whole zone has been scanned */
+	bool whole_zone;		/* Whole zone should/has been scanned */
 	int order;			/* order a direct compactor needs */
 	const gfp_t gfp_mask;		/* gfp mask of a direct compactor */
 	const unsigned int alloc_flags;	/* alloc flags of a direct compactor */
-- 
2.8.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 10/17] mm, compaction: cleanup unused functions
  2016-06-24  9:54 [PATCH v3 00/17] make direct compaction more deterministic Vlastimil Babka
                   ` (8 preceding siblings ...)
  2016-06-24  9:54 ` [PATCH v3 09/17] mm, compaction: make whole_zone flag ignore cached scanner positions Vlastimil Babka
@ 2016-06-24  9:54 ` Vlastimil Babka
  2016-06-24 11:53   ` Vlastimil Babka
  2016-06-24  9:54 ` [PATCH v3 11/17] mm, compaction: add the ultimate direct compaction priority Vlastimil Babka
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 37+ messages in thread
From: Vlastimil Babka @ 2016-06-24  9:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Mel Gorman, Joonsoo Kim,
	David Rientjes, Rik van Riel, Vlastimil Babka

Since kswapd compaction moved to kcompactd, compact_pgdat() is not called
anymore, so we remove it. The only caller of __compact_pgdat() is
compact_node(), so we merge them and remove code that was only reachable from
kswapd.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/compaction.h |  5 ----
 mm/compaction.c            | 60 +++++++++++++---------------------------------
 2 files changed, 17 insertions(+), 48 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 095aaa220952..0cc702ec80a2 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -70,7 +70,6 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
 		unsigned int order, unsigned int alloc_flags,
 		const struct alloc_context *ac, enum compact_priority prio);
-extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern enum compact_result compaction_suitable(struct zone *zone, int order,
 		unsigned int alloc_flags, int classzone_idx);
@@ -167,10 +166,6 @@ static inline void __ClearPageMovable(struct page *page)
 {
 }
 
-static inline void compact_pgdat(pg_data_t *pgdat, int order)
-{
-}
-
 static inline void reset_isolation_suitable(pg_data_t *pgdat)
 {
 }
diff --git a/mm/compaction.c b/mm/compaction.c
index e7fe848e318e..5c15db0001a5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1736,10 +1736,18 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 
 
 /* Compact all zones within a node */
-static void __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
+static void compact_node(int nid)
 {
+	pg_data_t *pgdat = NODE_DATA(nid);
 	int zoneid;
 	struct zone *zone;
+	struct compact_control cc = {
+		.order = -1,
+		.mode = MIGRATE_SYNC,
+		.ignore_skip_hint = true,
+		.whole_zone = true,
+	};
+
 
 	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
 
@@ -1747,53 +1755,19 @@ static void __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
 		if (!populated_zone(zone))
 			continue;
 
-		cc->nr_freepages = 0;
-		cc->nr_migratepages = 0;
-		cc->zone = zone;
-		INIT_LIST_HEAD(&cc->freepages);
-		INIT_LIST_HEAD(&cc->migratepages);
-
-		if (is_via_compact_memory(cc->order) ||
-				!compaction_deferred(zone, cc->order))
-			compact_zone(zone, cc);
-
-		VM_BUG_ON(!list_empty(&cc->freepages));
-		VM_BUG_ON(!list_empty(&cc->migratepages));
+		cc.nr_freepages = 0;
+		cc.nr_migratepages = 0;
+		cc.zone = zone;
+		INIT_LIST_HEAD(&cc.freepages);
+		INIT_LIST_HEAD(&cc.migratepages);
 
-		if (is_via_compact_memory(cc->order))
-			continue;
+		compact_zone(zone, &cc);
 
-		if (zone_watermark_ok(zone, cc->order,
-				low_wmark_pages(zone), 0, 0))
-			compaction_defer_reset(zone, cc->order, false);
+		VM_BUG_ON(!list_empty(&cc.freepages));
+		VM_BUG_ON(!list_empty(&cc.migratepages));
 	}
 }
 
-void compact_pgdat(pg_data_t *pgdat, int order)
-{
-	struct compact_control cc = {
-		.order = order,
-		.mode = MIGRATE_ASYNC,
-	};
-
-	if (!order)
-		return;
-
-	__compact_pgdat(pgdat, &cc);
-}
-
-static void compact_node(int nid)
-{
-	struct compact_control cc = {
-		.order = -1,
-		.mode = MIGRATE_SYNC,
-		.ignore_skip_hint = true,
-		.whole_zone = true,
-	};
-
-	__compact_pgdat(NODE_DATA(nid), &cc);
-}
-
 /* Compact all nodes in the system */
 static void compact_nodes(void)
 {
-- 
2.8.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 11/17] mm, compaction: add the ultimate direct compaction priority
  2016-06-24  9:54 [PATCH v3 00/17] make direct compaction more deterministic Vlastimil Babka
                   ` (9 preceding siblings ...)
  2016-06-24  9:54 ` [PATCH v3 10/17] mm, compaction: cleanup unused functions Vlastimil Babka
@ 2016-06-24  9:54 ` Vlastimil Babka
  2016-06-24  9:54 ` [PATCH v3 12/17] mm, compaction: more reliably increase " Vlastimil Babka
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 37+ messages in thread
From: Vlastimil Babka @ 2016-06-24  9:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Mel Gorman, Joonsoo Kim,
	David Rientjes, Rik van Riel, Vlastimil Babka

During reclaim/compaction loop, it's desirable to get a final answer from
unsuccessful compaction so we can either fail the allocation or invoke the OOM
killer. However, heuristics such as deferred compaction or pageblock skip bits
can cause compaction to skip parts or whole zones and lead to premature OOM's,
failures or excessive reclaim/compaction retries.

To remedy this, we introduce a new direct compaction priority called
COMPACT_PRIO_SYNC_FULL, which instructs direct compaction to:

- ignore deferred compaction status for a zone
- ignore pageblock skip hints
- ignore cached scanner positions and scan the whole zone

The new priority should get eventually picked up by should_compact_retry() and
this should improve success rates for costly allocations using __GFP_REPEAT,
such as hugetlbfs allocations, and reduce some corner-case OOM's for non-costly
allocations.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/compaction.h | 3 ++-
 mm/compaction.c            | 5 ++++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 0cc702ec80a2..869b594cf4ff 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -6,8 +6,9 @@
  * Lower value means higher priority, analogically to reclaim priority.
  */
 enum compact_priority {
+	COMPACT_PRIO_SYNC_FULL,
+	MIN_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_FULL,
 	COMPACT_PRIO_SYNC_LIGHT,
-	MIN_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
 	DEF_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
 	COMPACT_PRIO_ASYNC,
 	INIT_COMPACT_PRIORITY = COMPACT_PRIO_ASYNC
diff --git a/mm/compaction.c b/mm/compaction.c
index 5c15db0001a5..76897850c3c2 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1644,6 +1644,8 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
 		.alloc_flags = alloc_flags,
 		.classzone_idx = classzone_idx,
 		.direct_compaction = true,
+		.whole_zone = (prio == COMPACT_PRIO_SYNC_FULL),
+		.ignore_skip_hint = (prio == COMPACT_PRIO_SYNC_FULL)
 	};
 	INIT_LIST_HEAD(&cc.freepages);
 	INIT_LIST_HEAD(&cc.migratepages);
@@ -1689,7 +1691,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 								ac->nodemask) {
 		enum compact_result status;
 
-		if (compaction_deferred(zone, order)) {
+		if (prio > COMPACT_PRIO_SYNC_FULL
+					&& compaction_deferred(zone, order)) {
 			rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
 			continue;
 		}
-- 
2.8.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 12/17] mm, compaction: more reliably increase direct compaction priority
  2016-06-24  9:54 [PATCH v3 00/17] make direct compaction more deterministic Vlastimil Babka
                   ` (10 preceding siblings ...)
  2016-06-24  9:54 ` [PATCH v3 11/17] mm, compaction: add the ultimate direct compaction priority Vlastimil Babka
@ 2016-06-24  9:54 ` Vlastimil Babka
  2016-07-06  5:39   ` Joonsoo Kim
  2016-06-24  9:54 ` [PATCH v3 13/17] mm, compaction: use correct watermark when checking allocation success Vlastimil Babka
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 37+ messages in thread
From: Vlastimil Babka @ 2016-06-24  9:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Mel Gorman, Joonsoo Kim,
	David Rientjes, Rik van Riel, Vlastimil Babka

During reclaim/compaction loop, compaction priority can be increased by the
should_compact_retry() function, but the current code is not optimal. Priority
is only increased when compaction_failed() is true, which means that compaction
has scanned the whole zone. This may not happen even after multiple attempts
with the lower priority due to parallel activity, so we might needlessly
struggle on the lower priority and possibly run out of compaction retry
attempts in the process.

We can remove these corner cases by increasing compaction priority regardless
of compaction_failed(). Examining further the compaction result can be
postponed only after reaching the highest priority. This is a simple solution
and we don't need to worry about reaching the highest priority "too soon" here,
because hen should_compact_retry() is called it means that the system is
already struggling and the allocation is supposed to either try as hard as
possible, or it cannot fail at all. There's not much point staying at lower
priorities with heuristics that may result in only partial compaction.
Also we now count compaction retries only after reaching the highest priority.

The only exception here is the COMPACT_SKIPPED result, which means that
compaction could not run at all due to being below order-0 watermarks. In that
case, don't increase compaction priority, and check if compaction could proceed
when everything reclaimable was reclaimed. Before this patch, this was tied to
compaction_withdrawn(), but the other results considered there are in fact only
possible due to low compaction priority so we can ignore them thanks to the
patch. Since there are no other callers of compaction_withdrawn(), change its
semantics to remove the low priority scenarios.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/compaction.h | 28 ++-----------------------
 mm/page_alloc.c            | 51 ++++++++++++++++++++++++++--------------------
 2 files changed, 31 insertions(+), 48 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 869b594cf4ff..a6b3d5d2ae53 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -106,8 +106,8 @@ static inline bool compaction_failed(enum compact_result result)
 }
 
 /*
- * Compaction  has backed off for some reason. It might be throttling or
- * lock contention. Retrying is still worthwhile.
+ * Compaction has backed off because it cannot proceed until there is enough
+ * free memory. Retrying is still worthwhile after reclaim.
  */
 static inline bool compaction_withdrawn(enum compact_result result)
 {
@@ -118,30 +118,6 @@ static inline bool compaction_withdrawn(enum compact_result result)
 	if (result == COMPACT_SKIPPED)
 		return true;
 
-	/*
-	 * If compaction is deferred for high-order allocations, it is
-	 * because sync compaction recently failed. If this is the case
-	 * and the caller requested a THP allocation, we do not want
-	 * to heavily disrupt the system, so we fail the allocation
-	 * instead of entering direct reclaim.
-	 */
-	if (result == COMPACT_DEFERRED)
-		return true;
-
-	/*
-	 * If compaction in async mode encounters contention or blocks higher
-	 * priority task we back off early rather than cause stalls.
-	 */
-	if (result == COMPACT_CONTENDED)
-		return true;
-
-	/*
-	 * Page scanners have met but we haven't scanned full zones so this
-	 * is a back off in fact.
-	 */
-	if (result == COMPACT_PARTIAL_SKIPPED)
-		return true;
-
 	return false;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 204cc988fd64..e1efdc8d2a52 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3217,7 +3217,7 @@ static inline bool
 should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
 		     enum compact_result compact_result,
 		     enum compact_priority *compact_priority,
-		     int compaction_retries)
+		     int *compaction_retries)
 {
 	int max_retries = MAX_COMPACT_RETRIES;
 
@@ -3225,28 +3225,35 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
 		return false;
 
 	/*
-	 * compaction considers all the zone as desperately out of memory
-	 * so it doesn't really make much sense to retry except when the
-	 * failure could be caused by insufficient priority
+	 * Compaction backed off due to watermark checks for order-0
+	 * so the regular reclaim has to try harder and reclaim something
+	 * Retry only if it looks like reclaim might have a chance.
 	 */
-	if (compaction_failed(compact_result)) {
-		if (*compact_priority > MIN_COMPACT_PRIORITY) {
-			(*compact_priority)--;
-			return true;
-		}
-		return false;
+	if (compaction_withdrawn(compact_result))
+		return compaction_zonelist_suitable(ac, order, alloc_flags);
+
+	/*
+	 * Compaction could have withdrawn early or skip some zones or
+	 * pageblocks. We were asked to retry, which means the allocation
+	 * should try really hard, so increase the priority if possible.
+	 */
+	if (*compact_priority > MIN_COMPACT_PRIORITY) {
+		(*compact_priority)--;
+		return true;
 	}
 
 	/*
-	 * make sure the compaction wasn't deferred or didn't bail out early
-	 * due to locks contention before we declare that we should give up.
-	 * But do not retry if the given zonelist is not suitable for
-	 * compaction.
+	 * Compaction considers all the zones as unfixably fragmented and we
+	 * are on the highest priority, which means it can't be due to
+	 * heuristics and it doesn't really make much sense to retry.
 	 */
-	if (compaction_withdrawn(compact_result))
-		return compaction_zonelist_suitable(ac, order, alloc_flags);
+	if (compaction_failed(compact_result))
+		return false;
 
 	/*
+	 * The remaining possibility is that compaction made progress and
+	 * created a high-order page, but it was allocated by somebody else.
+	 * To prevent thrashing, limit the number of retries in such case.
 	 * !costly requests are much more important than __GFP_REPEAT
 	 * costly ones because they are de facto nofail and invoke OOM
 	 * killer to move on while costly can fail and users are ready
@@ -3254,9 +3261,12 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
 	 * would need much more detailed feedback from compaction to
 	 * make a better decision.
 	 */
+	if (compaction_made_progress(compact_result))
+		(*compaction_retries)++;
+
 	if (order > PAGE_ALLOC_COSTLY_ORDER)
 		max_retries /= 4;
-	if (compaction_retries <= max_retries)
+	if (*compaction_retries <= max_retries)
 		return true;
 
 	return false;
@@ -3275,7 +3285,7 @@ static inline bool
 should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_flags,
 		     enum compact_result compact_result,
 		     enum compact_priority *compact_priority,
-		     int compaction_retries)
+		     int *compaction_retries)
 {
 	struct zone *zone;
 	struct zoneref *z;
@@ -3672,9 +3682,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (page)
 		goto got_pg;
 
-	if (order && compaction_made_progress(compact_result))
-		compaction_retries++;
-
 	/* Do not loop if specifically requested */
 	if (gfp_mask & __GFP_NORETRY)
 		goto nopage;
@@ -3709,7 +3716,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (did_some_progress > 0 &&
 			should_compact_retry(ac, order, alloc_flags,
 				compact_result, &compact_priority,
-				compaction_retries))
+				&compaction_retries))
 		goto retry;
 
 	/* Reclaim has failed us, start killing things */
-- 
2.8.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 13/17] mm, compaction: use correct watermark when checking allocation success
  2016-06-24  9:54 [PATCH v3 00/17] make direct compaction more deterministic Vlastimil Babka
                   ` (11 preceding siblings ...)
  2016-06-24  9:54 ` [PATCH v3 12/17] mm, compaction: more reliably increase " Vlastimil Babka
@ 2016-06-24  9:54 ` Vlastimil Babka
  2016-07-06  5:47   ` Joonsoo Kim
  2016-06-24  9:54 ` [PATCH v3 14/17] mm, compaction: create compact_gap wrapper Vlastimil Babka
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 37+ messages in thread
From: Vlastimil Babka @ 2016-06-24  9:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Mel Gorman, Joonsoo Kim,
	David Rientjes, Rik van Riel, Vlastimil Babka

The __compact_finished() function uses low watermark in a check that has to
pass if the direct compaction is to finish and allocation should succeed. This
is too pessimistic, as the allocation will typically use min watermark. It may
happen that during compaction, we drop below the low watermark (due to parallel
activity), but still form the target high-order page. By checking against low
watermark, we might needlessly continue compaction.

Similarly, __compaction_suitable() uses low watermark in a check whether
allocation can succeed without compaction. Again, this is unnecessarily
pessimistic.

After this patch, these check will use direct compactor's alloc_flags to
determine the watermark, which is effectively the min watermark.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/compaction.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 76897850c3c2..371760a85085 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1320,7 +1320,7 @@ static enum compact_result __compact_finished(struct zone *zone, struct compact_
 		return COMPACT_CONTINUE;
 
 	/* Compaction run is not finished if the watermark is not met */
-	watermark = low_wmark_pages(zone);
+	watermark = zone->watermark[cc->alloc_flags & ALLOC_WMARK_MASK];
 
 	if (!zone_watermark_ok(zone, cc->order, watermark, cc->classzone_idx,
 							cc->alloc_flags))
@@ -1385,7 +1385,7 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
 	if (is_via_compact_memory(order))
 		return COMPACT_CONTINUE;
 
-	watermark = low_wmark_pages(zone);
+	watermark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 	/*
 	 * If watermarks for high-order allocation are already met, there
 	 * should be no need for compaction at all.
@@ -1399,7 +1399,7 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
 	 * This is because during migration, copies of pages need to be
 	 * allocated and for a short time, the footprint is higher
 	 */
-	watermark += (2UL << order);
+	watermark = low_wmark_pages(zone) + (2UL << order);
 	if (!__zone_watermark_ok(zone, 0, watermark, classzone_idx,
 				 alloc_flags, wmark_target))
 		return COMPACT_SKIPPED;
-- 
2.8.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 14/17] mm, compaction: create compact_gap wrapper
  2016-06-24  9:54 [PATCH v3 00/17] make direct compaction more deterministic Vlastimil Babka
                   ` (12 preceding siblings ...)
  2016-06-24  9:54 ` [PATCH v3 13/17] mm, compaction: use correct watermark when checking allocation success Vlastimil Babka
@ 2016-06-24  9:54 ` Vlastimil Babka
  2016-06-24  9:54 ` [PATCH v3 15/17] mm, compaction: use proper alloc_flags in __compaction_suitable() Vlastimil Babka
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 37+ messages in thread
From: Vlastimil Babka @ 2016-06-24  9:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Mel Gorman, Joonsoo Kim,
	David Rientjes, Rik van Riel, Vlastimil Babka

Compaction uses a watermark gap of (2UL << order) pages at various places and
it's not immediately obvious why. Abstract it through a compact_gap() wrapper
to create a single place with a thorough explanation.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/compaction.h | 16 ++++++++++++++++
 mm/compaction.c            |  7 +++----
 mm/vmscan.c                |  4 ++--
 3 files changed, 21 insertions(+), 6 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index a6b3d5d2ae53..67a3372c4753 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -58,6 +58,22 @@ enum compact_result {
 
 struct alloc_context; /* in mm/internal.h */
 
+/*
+ * Number of free order-0 pages that should be available above given watermark
+ * to make sure compaction has reasonable chance of not running out of free
+ * pages that it needs to isolate as migration target during its work.
+ */
+static inline unsigned long compact_gap(unsigned int order)
+{
+	/*
+	 * Although all the isolations for migration are temporary, compaction
+	 * may have up to 1 << order pages on its list and then try to split
+	 * an (order - 1) free page. At that point, a gap of 1 << order might
+	 * not be enough, so it's safer to require twice that amount.
+	 */
+	return 2UL << order;
+}
+
 #ifdef CONFIG_COMPACTION
 extern int sysctl_compact_memory;
 extern int sysctl_compaction_handler(struct ctl_table *table, int write,
diff --git a/mm/compaction.c b/mm/compaction.c
index 371760a85085..c1ce7c2abe05 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1395,11 +1395,10 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
 		return COMPACT_PARTIAL;
 
 	/*
-	 * Watermarks for order-0 must be met for compaction. Note the 2UL.
-	 * This is because during migration, copies of pages need to be
-	 * allocated and for a short time, the footprint is higher
+	 * Watermarks for order-0 must be met for compaction to be able to
+	 * isolate free pages for migration targets.
 	 */
-	watermark = low_wmark_pages(zone) + (2UL << order);
+	watermark = low_wmark_pages(zone) + compact_gap(order);
 	if (!__zone_watermark_ok(zone, 0, watermark, classzone_idx,
 				 alloc_flags, wmark_target))
 		return COMPACT_SKIPPED;
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 21d417ccff69..484ff05d5a8f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2351,7 +2351,7 @@ static inline bool should_continue_reclaim(struct zone *zone,
 	 * If we have not reclaimed enough pages for compaction and the
 	 * inactive lists are large enough, continue reclaiming
 	 */
-	pages_for_compaction = (2UL << sc->order);
+	pages_for_compaction = compact_gap(sc->order);
 	inactive_lru_pages = zone_page_state(zone, NR_INACTIVE_FILE);
 	if (get_nr_swap_pages() > 0)
 		inactive_lru_pages += zone_page_state(zone, NR_INACTIVE_ANON);
@@ -2478,7 +2478,7 @@ static inline bool compaction_ready(struct zone *zone, int order, int classzone_
 	 */
 	balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
 			zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
-	watermark = high_wmark_pages(zone) + balance_gap + (2UL << order);
+	watermark = high_wmark_pages(zone) + balance_gap + compact_gap(order);
 	watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, classzone_idx);
 
 	/*
-- 
2.8.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 15/17] mm, compaction: use proper alloc_flags in __compaction_suitable()
  2016-06-24  9:54 [PATCH v3 00/17] make direct compaction more deterministic Vlastimil Babka
                   ` (13 preceding siblings ...)
  2016-06-24  9:54 ` [PATCH v3 14/17] mm, compaction: create compact_gap wrapper Vlastimil Babka
@ 2016-06-24  9:54 ` Vlastimil Babka
  2016-06-24  9:54 ` [PATCH v3 16/17] mm, compaction: require only min watermarks for non-costly orders Vlastimil Babka
  2016-06-24  9:54 ` [PATCH v3 17/17] mm, vmscan: make compaction_ready() more accurate and readable Vlastimil Babka
  16 siblings, 0 replies; 37+ messages in thread
From: Vlastimil Babka @ 2016-06-24  9:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Mel Gorman, Joonsoo Kim,
	David Rientjes, Rik van Riel, Vlastimil Babka

The __compaction_suitable() function checks the low watermark plus a
compact_gap() gap to decide if there's enough free memory to perform
compaction. This check uses direct compactor's alloc_flags, but that's wrong,
since these flags are not applicable for freepage isolation.

For example, alloc_flags may indicate access to memory reserves, making
compaction proceed, and then fail watermark check during the isolation.

A similar problem exists for ALLOC_CMA, which may be part of alloc_flags, but
not during freepage isolation. In this case however it makes sense to use
ALLOC_CMA both in __compaction_suitable() and __isolate_free_page(), since
there's actually nothing preventing the freepage scanner to isolate from CMA
pageblocks, with the assumption that a page that could be migrated once by
compaction can be migrated also later by CMA allocation. Thus we should count
pages in CMA pageblocks when considering compaction suitability and when
isolating freepages.

To sum up, this patch should remove some false positives from
__compaction_suitable(), and allow compaction to proceed when free pages
required for compaction reside in the CMA pageblocks.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/compaction.c | 12 ++++++++++--
 mm/page_alloc.c |  2 +-
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index c1ce7c2abe05..3b774befb62a 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1396,11 +1396,19 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
 
 	/*
 	 * Watermarks for order-0 must be met for compaction to be able to
-	 * isolate free pages for migration targets.
+	 * isolate free pages for migration targets. This means that the
+	 * watermark and alloc_flags have to match, or be more pessimistic than
+	 * the check in __isolate_free_page(). We don't use the direct
+	 * compactor's alloc_flags, as they are not relevant for freepage
+	 * isolation. We however do use the direct compactor's classzone_idx to
+	 * skip over zones where lowmem reserves would prevent allocation even
+	 * if compaction succeeds.
+	 * ALLOC_CMA is used, as pages in CMA pageblocks are considered
+	 * suitable migration targets
 	 */
 	watermark = low_wmark_pages(zone) + compact_gap(order);
 	if (!__zone_watermark_ok(zone, 0, watermark, classzone_idx,
-				 alloc_flags, wmark_target))
+						ALLOC_CMA, wmark_target))
 		return COMPACT_SKIPPED;
 
 	/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index e1efdc8d2a52..9510b91517dd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2503,7 +2503,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
 	if (!is_migrate_isolate(mt)) {
 		/* Obey watermarks as if the page was being allocated */
 		watermark = low_wmark_pages(zone) + (1 << order);
-		if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
+		if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
 			return 0;
 
 		__mod_zone_freepage_state(zone, -(1UL << order), mt);
-- 
2.8.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 16/17] mm, compaction: require only min watermarks for non-costly orders
  2016-06-24  9:54 [PATCH v3 00/17] make direct compaction more deterministic Vlastimil Babka
                   ` (14 preceding siblings ...)
  2016-06-24  9:54 ` [PATCH v3 15/17] mm, compaction: use proper alloc_flags in __compaction_suitable() Vlastimil Babka
@ 2016-06-24  9:54 ` Vlastimil Babka
  2016-06-24  9:54 ` [PATCH v3 17/17] mm, vmscan: make compaction_ready() more accurate and readable Vlastimil Babka
  16 siblings, 0 replies; 37+ messages in thread
From: Vlastimil Babka @ 2016-06-24  9:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Mel Gorman, Joonsoo Kim,
	David Rientjes, Rik van Riel, Vlastimil Babka

The __compaction_suitable() function checks the low watermark plus a
compact_gap() gap to decide if there's enough free memory to perform
compaction. Then __isolate_free_page uses low watermark check to decide if
particular free page can be isolated. In the latter case, using low watermark
is needlessly pessimistic, as the free page isolations are only temporary. For
__compaction_suitable() the higher watermark makes sense for high-order
allocations where more freepages increase the chance of success, and we can
typically fail with some order-0 fallback when the system is struggling to
reach that watermark. But for low-order allocation, forming the page should not
be that hard. So using low watermark here might just prevent compaction from
even trying, and eventually lead to OOM killer even if we are above min
watermarks.

So after this patch, we use min watermark for non-costly orders in
__compaction_suitable(), and for all orders in __isolate_free_page().

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/compaction.c | 6 +++++-
 mm/page_alloc.c | 2 +-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 3b774befb62a..ddff4cc48067 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1403,10 +1403,14 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
 	 * isolation. We however do use the direct compactor's classzone_idx to
 	 * skip over zones where lowmem reserves would prevent allocation even
 	 * if compaction succeeds.
+	 * For costly orders, we require low watermark instead of min for
+	 * compaction to proceed to increase its chances.
 	 * ALLOC_CMA is used, as pages in CMA pageblocks are considered
 	 * suitable migration targets
 	 */
-	watermark = low_wmark_pages(zone) + compact_gap(order);
+	watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
+				low_wmark_pages(zone) : min_wmark_pages(zone);
+	watermark += compact_gap(order);
 	if (!__zone_watermark_ok(zone, 0, watermark, classzone_idx,
 						ALLOC_CMA, wmark_target))
 		return COMPACT_SKIPPED;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9510b91517dd..4a963659f8bb 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2502,7 +2502,7 @@ int __isolate_free_page(struct page *page, unsigned int order)
 
 	if (!is_migrate_isolate(mt)) {
 		/* Obey watermarks as if the page was being allocated */
-		watermark = low_wmark_pages(zone) + (1 << order);
+		watermark = min_wmark_pages(zone) + (1UL << order);
 		if (!zone_watermark_ok(zone, 0, watermark, 0, ALLOC_CMA))
 			return 0;
 
-- 
2.8.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 17/17] mm, vmscan: make compaction_ready() more accurate and readable
  2016-06-24  9:54 [PATCH v3 00/17] make direct compaction more deterministic Vlastimil Babka
                   ` (15 preceding siblings ...)
  2016-06-24  9:54 ` [PATCH v3 16/17] mm, compaction: require only min watermarks for non-costly orders Vlastimil Babka
@ 2016-06-24  9:54 ` Vlastimil Babka
  2016-07-06  5:55   ` Joonsoo Kim
  16 siblings, 1 reply; 37+ messages in thread
From: Vlastimil Babka @ 2016-06-24  9:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Mel Gorman, Joonsoo Kim,
	David Rientjes, Rik van Riel, Vlastimil Babka

The compaction_ready() is used during direct reclaim for costly order
allocations to skip reclaim for zones where compaction should be attempted
instead. It's combining the standard compaction_suitable() check with its own
watermark check based on high watermark with extra gap, and the result is
confusing at best.

This patch attempts to better structure and document the checks involved.
First, compaction_suitable() can determine that the allocation should either
succeed already, or that compaction doesn't have enough free pages to proceed.
The third possibility is that compaction has enough free pages, but we still
decide to reclaim first - unless we are already above the high watermark with
gap.  This does not mean that the reclaim will actually reach this watermark
during single attempt, this is rather an over-reclaim protection. So document
the code as such. The check for compaction_deferred() is removed completely, as
it in fact had no proper role here.

The result after this patch is mainly a less confusing code. We also skip some
over-reclaim in cases where the allocation should already succed.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/vmscan.c | 43 ++++++++++++++++++++-----------------------
 1 file changed, 20 insertions(+), 23 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 484ff05d5a8f..724131661f0c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2462,40 +2462,37 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
 }
 
 /*
- * Returns true if compaction should go ahead for a high-order request, or
- * the high-order allocation would succeed without compaction.
+ * Returns true if compaction should go ahead for a costly-order request, or
+ * the allocation would already succeed without compaction. Return false if we
+ * should reclaim first.
  */
 static inline bool compaction_ready(struct zone *zone, int order, int classzone_idx)
 {
 	unsigned long balance_gap, watermark;
-	bool watermark_ok;
+	enum compact_result suitable;
+
+	suitable = compaction_suitable(zone, order, 0, classzone_idx);
+	if (suitable == COMPACT_PARTIAL)
+		/* Allocation should succeed already. Don't reclaim. */
+		return true;
+	if (suitable == COMPACT_SKIPPED)
+		/* Compaction cannot yet proceed. Do reclaim. */
+		return false;
 
 	/*
-	 * Compaction takes time to run and there are potentially other
-	 * callers using the pages just freed. Continue reclaiming until
-	 * there is a buffer of free pages available to give compaction
-	 * a reasonable chance of completing and allocating the page
+	 * Compaction is already possible, but it takes time to run and there
+	 * are potentially other callers using the pages just freed. So proceed
+	 * with reclaim to make a buffer of free pages available to give
+	 * compaction a reasonable chance of completing and allocating the page.
+	 * Note that we won't actually reclaim the whole buffer in one attempt
+	 * as the target watermark in should_continue_reclaim() is lower. But if
+	 * we are already above the high+gap watermark, don't reclaim at all.
 	 */
 	balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
 			zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
 	watermark = high_wmark_pages(zone) + balance_gap + compact_gap(order);
-	watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, classzone_idx);
-
-	/*
-	 * If compaction is deferred, reclaim up to a point where
-	 * compaction will have a chance of success when re-enabled
-	 */
-	if (compaction_deferred(zone, order))
-		return watermark_ok;
-
-	/*
-	 * If compaction is not ready to start and allocation is not likely
-	 * to succeed without it, then keep reclaiming.
-	 */
-	if (compaction_suitable(zone, order, 0, classzone_idx) == COMPACT_SKIPPED)
-		return false;
 
-	return watermark_ok;
+	return zone_watermark_ok_safe(zone, 0, watermark, classzone_idx);
 }
 
 /*
-- 
2.8.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 07/17] mm, compaction: introduce direct compaction priority
  2016-06-24  9:54 ` [PATCH v3 07/17] mm, compaction: introduce direct compaction priority Vlastimil Babka
@ 2016-06-24 11:39   ` kbuild test robot
  2016-06-24 11:51     ` Vlastimil Babka
  0 siblings, 1 reply; 37+ messages in thread
From: kbuild test robot @ 2016-06-24 11:39 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: kbuild-all, Andrew Morton, linux-kernel, linux-mm, Michal Hocko,
	Mel Gorman, Joonsoo Kim, David Rientjes, Rik van Riel,
	Vlastimil Babka

[-- Attachment #1: Type: text/plain, Size: 4808 bytes --]

Hi,

[auto build test ERROR on next-20160624]
[cannot apply to tip/perf/core v4.7-rc4 v4.7-rc3 v4.7-rc2 v4.7-rc4]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Vlastimil-Babka/make-direct-compaction-more-deterministic/20160624-180056
config: m68k-sun3_defconfig (attached as .config)
compiler: m68k-linux-gcc (GCC) 4.9.0
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=m68k 

All errors (new ones prefixed by >>):

   In file included from mm/page_alloc.c:60:0:
>> include/linux/migrate.h:79:19: error: redefinition of 'PageMovable'
    static inline int PageMovable(struct page *page) { return 0; };
                      ^
   In file included from mm/page_alloc.c:56:0:
   include/linux/compaction.h:166:19: note: previous definition of 'PageMovable' was here
    static inline int PageMovable(struct page *page)
                      ^
   In file included from mm/page_alloc.c:60:0:
>> include/linux/migrate.h:80:20: error: redefinition of '__SetPageMovable'
    static inline void __SetPageMovable(struct page *page,
                       ^
   In file included from mm/page_alloc.c:56:0:
   include/linux/compaction.h:170:20: note: previous definition of '__SetPageMovable' was here
    static inline void __SetPageMovable(struct page *page,
                       ^
   In file included from mm/page_alloc.c:60:0:
>> include/linux/migrate.h:84:20: error: redefinition of '__ClearPageMovable'
    static inline void __ClearPageMovable(struct page *page)
                       ^
   In file included from mm/page_alloc.c:56:0:
   include/linux/compaction.h:175:20: note: previous definition of '__ClearPageMovable' was here
    static inline void __ClearPageMovable(struct page *page)
                       ^
--
   In file included from mm/compaction.c:13:0:
>> include/linux/compaction.h:166:19: error: redefinition of 'PageMovable'
    static inline int PageMovable(struct page *page)
                      ^
   In file included from mm/compaction.c:12:0:
   include/linux/migrate.h:79:19: note: previous definition of 'PageMovable' was here
    static inline int PageMovable(struct page *page) { return 0; };
                      ^
   In file included from mm/compaction.c:13:0:
>> include/linux/compaction.h:170:20: error: redefinition of '__SetPageMovable'
    static inline void __SetPageMovable(struct page *page,
                       ^
   In file included from mm/compaction.c:12:0:
   include/linux/migrate.h:80:20: note: previous definition of '__SetPageMovable' was here
    static inline void __SetPageMovable(struct page *page,
                       ^
   In file included from mm/compaction.c:13:0:
>> include/linux/compaction.h:175:20: error: redefinition of '__ClearPageMovable'
    static inline void __ClearPageMovable(struct page *page)
                       ^
   In file included from mm/compaction.c:12:0:
   include/linux/migrate.h:84:20: note: previous definition of '__ClearPageMovable' was here
    static inline void __ClearPageMovable(struct page *page)
                       ^

vim +/__SetPageMovable +80 include/linux/migrate.h

7039e1db Peter Zijlstra 2012-10-25  73  
e8c9f6f5 Minchan Kim    2016-06-24  74  #ifdef CONFIG_COMPACTION
e8c9f6f5 Minchan Kim    2016-06-24  75  extern int PageMovable(struct page *page);
e8c9f6f5 Minchan Kim    2016-06-24  76  extern void __SetPageMovable(struct page *page, struct address_space *mapping);
e8c9f6f5 Minchan Kim    2016-06-24  77  extern void __ClearPageMovable(struct page *page);
e8c9f6f5 Minchan Kim    2016-06-24  78  #else
e8c9f6f5 Minchan Kim    2016-06-24 @79  static inline int PageMovable(struct page *page) { return 0; };
e8c9f6f5 Minchan Kim    2016-06-24 @80  static inline void __SetPageMovable(struct page *page,
e8c9f6f5 Minchan Kim    2016-06-24  81  				struct address_space *mapping)
e8c9f6f5 Minchan Kim    2016-06-24  82  {
e8c9f6f5 Minchan Kim    2016-06-24  83  }
e8c9f6f5 Minchan Kim    2016-06-24 @84  static inline void __ClearPageMovable(struct page *page)
e8c9f6f5 Minchan Kim    2016-06-24  85  {
e8c9f6f5 Minchan Kim    2016-06-24  86  }
e8c9f6f5 Minchan Kim    2016-06-24  87  #endif

:::::: The code at line 80 was first introduced by commit
:::::: e8c9f6f50a2424f46bc72557af356f4be8f835fe mm: fix build warnings in <linux/compaction.h>

:::::: TO: Minchan Kim <minchan@kernel.org>
:::::: CC: Stephen Rothwell <sfr@canb.auug.org.au>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/octet-stream, Size: 11801 bytes --]

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 07/17] mm, compaction: introduce direct compaction priority
  2016-06-24 11:39   ` kbuild test robot
@ 2016-06-24 11:51     ` Vlastimil Babka
  0 siblings, 0 replies; 37+ messages in thread
From: Vlastimil Babka @ 2016-06-24 11:51 UTC (permalink / raw)
  To: kbuild test robot
  Cc: kbuild-all, Andrew Morton, linux-kernel, linux-mm, Michal Hocko,
	Mel Gorman, Joonsoo Kim, David Rientjes, Rik van Riel

On 06/24/2016 01:39 PM, kbuild test robot wrote:
> Hi,
> 
> [auto build test ERROR on next-20160624]
> [cannot apply to tip/perf/core v4.7-rc4 v4.7-rc3 v4.7-rc2 v4.7-rc4]
> [if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

Hmm, rebasing snafu. Here's updated patch:

----8<----
>From 10ac187717494a086b961a8c1eaea17e091180a2 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Mon, 2 May 2016 15:40:29 +0200
Subject: [PATCH v3 07/17] mm, compaction: introduce direct compaction priority

In the context of direct compaction, for some types of allocations we would
like the compaction to either succeed or definitely fail while trying as hard
as possible. Current async/sync_light migration mode is insufficient, as there
are heuristics such as caching scanner positions, marking pageblocks as
unsuitable or deferring compaction for a zone. At least the final compaction
attempt should be able to override these heuristics.

To communicate how hard compaction should try, we replace migration mode with
a new enum compact_priority and change the relevant function signatures. In
compact_zone_order() where struct compact_control is constructed, the priority
is mapped to suitable control flags. This patch itself has no functional
change, as the current priority levels are mapped back to the same migration
modes as before. Expanding them will be done next.

Note that !CONFIG_COMPACTION variant of try_to_compact_pages() is removed, as
the only caller exists under CONFIG_COMPACTION.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/compaction.h        | 22 +++++++++++++---------
 include/trace/events/compaction.h | 12 ++++++------
 mm/compaction.c                   | 13 +++++++------
 mm/page_alloc.c                   | 28 ++++++++++++++--------------
 4 files changed, 40 insertions(+), 35 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 1a02dab16646..0980a6ce4436 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -1,6 +1,18 @@
 #ifndef _LINUX_COMPACTION_H
 #define _LINUX_COMPACTION_H
 
+/*
+ * Determines how hard direct compaction should try to succeed.
+ * Lower value means higher priority, analogically to reclaim priority.
+ */
+enum compact_priority {
+	COMPACT_PRIO_SYNC_LIGHT,
+	MIN_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
+	DEF_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
+	COMPACT_PRIO_ASYNC,
+	INIT_COMPACT_PRIORITY = COMPACT_PRIO_ASYNC
+};
+
 /* Return values for compact_zone() and try_to_compact_pages() */
 /* When adding new states, please adjust include/trace/events/compaction.h */
 enum compact_result {
@@ -66,7 +78,7 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
 			unsigned int order,
 		unsigned int alloc_flags, const struct alloc_context *ac,
-		enum migrate_mode mode, int *contended);
+		enum compact_priority prio, int *contended);
 extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern enum compact_result compaction_suitable(struct zone *zone, int order,
@@ -151,14 +163,6 @@ extern void kcompactd_stop(int nid);
 extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
 
 #else
-static inline enum compact_result try_to_compact_pages(gfp_t gfp_mask,
-			unsigned int order, int alloc_flags,
-			const struct alloc_context *ac,
-			enum migrate_mode mode, int *contended)
-{
-	return COMPACT_CONTINUE;
-}
-
 static inline void compact_pgdat(pg_data_t *pgdat, int order)
 {
 }
diff --git a/include/trace/events/compaction.h b/include/trace/events/compaction.h
index 36e2d6fb1360..c2ba402ab256 100644
--- a/include/trace/events/compaction.h
+++ b/include/trace/events/compaction.h
@@ -226,26 +226,26 @@ TRACE_EVENT(mm_compaction_try_to_compact_pages,
 	TP_PROTO(
 		int order,
 		gfp_t gfp_mask,
-		enum migrate_mode mode),
+		int prio),
 
-	TP_ARGS(order, gfp_mask, mode),
+	TP_ARGS(order, gfp_mask, prio),
 
 	TP_STRUCT__entry(
 		__field(int, order)
 		__field(gfp_t, gfp_mask)
-		__field(enum migrate_mode, mode)
+		__field(int, prio)
 	),
 
 	TP_fast_assign(
 		__entry->order = order;
 		__entry->gfp_mask = gfp_mask;
-		__entry->mode = mode;
+		__entry->prio = prio;
 	),
 
-	TP_printk("order=%d gfp_mask=0x%x mode=%d",
+	TP_printk("order=%d gfp_mask=0x%x priority=%d",
 		__entry->order,
 		__entry->gfp_mask,
-		(int)__entry->mode)
+		__entry->prio)
 );
 
 DECLARE_EVENT_CLASS(mm_compaction_suitable_template,
diff --git a/mm/compaction.c b/mm/compaction.c
index b7b696e46eaa..4ed4f3232d8b 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1630,7 +1630,7 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
 }
 
 static enum compact_result compact_zone_order(struct zone *zone, int order,
-		gfp_t gfp_mask, enum migrate_mode mode, int *contended,
+		gfp_t gfp_mask, enum compact_priority prio, int *contended,
 		unsigned int alloc_flags, int classzone_idx)
 {
 	enum compact_result ret;
@@ -1640,7 +1640,8 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
 		.order = order,
 		.gfp_mask = gfp_mask,
 		.zone = zone,
-		.mode = mode,
+		.mode = (prio == COMPACT_PRIO_ASYNC) ?
+					MIGRATE_ASYNC :	MIGRATE_SYNC_LIGHT,
 		.alloc_flags = alloc_flags,
 		.classzone_idx = classzone_idx,
 		.direct_compaction = true,
@@ -1673,7 +1674,7 @@ int sysctl_extfrag_threshold = 500;
  */
 enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 		unsigned int alloc_flags, const struct alloc_context *ac,
-		enum migrate_mode mode, int *contended)
+		enum compact_priority prio, int *contended)
 {
 	int may_enter_fs = gfp_mask & __GFP_FS;
 	int may_perform_io = gfp_mask & __GFP_IO;
@@ -1688,7 +1689,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 	if (!order || !may_enter_fs || !may_perform_io)
 		return COMPACT_SKIPPED;
 
-	trace_mm_compaction_try_to_compact_pages(order, gfp_mask, mode);
+	trace_mm_compaction_try_to_compact_pages(order, gfp_mask, prio);
 
 	/* Compact each zone in the list */
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
@@ -1701,7 +1702,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 			continue;
 		}
 
-		status = compact_zone_order(zone, order, gfp_mask, mode,
+		status = compact_zone_order(zone, order, gfp_mask, prio,
 				&zone_contended, alloc_flags,
 				ac_classzone_idx(ac));
 		rc = max(status, rc);
@@ -1735,7 +1736,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 			goto break_loop;
 		}
 
-		if (mode != MIGRATE_ASYNC && (status == COMPACT_COMPLETE ||
+		if (prio != COMPACT_PRIO_ASYNC && (status == COMPACT_COMPLETE ||
 					status == COMPACT_PARTIAL_SKIPPED)) {
 			/*
 			 * We think that allocation won't succeed in this zone
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8ea9d1be54e9..fc0f2a3d4e5c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3170,7 +3170,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 static struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		unsigned int alloc_flags, const struct alloc_context *ac,
-		enum migrate_mode mode, enum compact_result *compact_result)
+		enum compact_priority prio, enum compact_result *compact_result)
 {
 	struct page *page;
 	int contended_compaction;
@@ -3180,7 +3180,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 
 	current->flags |= PF_MEMALLOC;
 	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
-						mode, &contended_compaction);
+						prio, &contended_compaction);
 	current->flags &= ~PF_MEMALLOC;
 
 	if (*compact_result <= COMPACT_INACTIVE)
@@ -3234,7 +3234,8 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 
 static inline bool
 should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
-		     enum compact_result compact_result, enum migrate_mode *migrate_mode,
+		     enum compact_result compact_result,
+		     enum compact_priority *compact_priority,
 		     int compaction_retries)
 {
 	int max_retries = MAX_COMPACT_RETRIES;
@@ -3245,11 +3246,11 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
 	/*
 	 * compaction considers all the zone as desperately out of memory
 	 * so it doesn't really make much sense to retry except when the
-	 * failure could be caused by weak migration mode.
+	 * failure could be caused by insufficient priority
 	 */
 	if (compaction_failed(compact_result)) {
-		if (*migrate_mode == MIGRATE_ASYNC) {
-			*migrate_mode = MIGRATE_SYNC_LIGHT;
+		if (*compact_priority > MIN_COMPACT_PRIORITY) {
+			(*compact_priority)--;
 			return true;
 		}
 		return false;
@@ -3283,7 +3284,7 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
 static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		unsigned int alloc_flags, const struct alloc_context *ac,
-		enum migrate_mode mode, enum compact_result *compact_result)
+		enum compact_priority prio, enum compact_result *compact_result)
 {
 	*compact_result = COMPACT_SKIPPED;
 	return NULL;
@@ -3292,7 +3293,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 static inline bool
 should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_flags,
 		     enum compact_result compact_result,
-		     enum migrate_mode *migrate_mode,
+		     enum compact_priority *compact_priority,
 		     int compaction_retries)
 {
 	struct zone *zone;
@@ -3545,7 +3546,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct page *page = NULL;
 	unsigned int alloc_flags;
 	unsigned long did_some_progress;
-	enum migrate_mode migration_mode = MIGRATE_SYNC_LIGHT;
+	enum compact_priority compact_priority = DEF_COMPACT_PRIORITY;
 	enum compact_result compact_result;
 	int compaction_retries = 0;
 	int no_progress_loops = 0;
@@ -3594,7 +3595,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (can_direct_reclaim && order > PAGE_ALLOC_COSTLY_ORDER) {
 		page = __alloc_pages_direct_compact(gfp_mask, order,
 						alloc_flags, ac,
-						MIGRATE_ASYNC,
+						INIT_COMPACT_PRIORITY,
 						&compact_result);
 		if (page)
 			goto got_pg;
@@ -3627,7 +3628,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 			 * sync compaction could be very expensive, so keep
 			 * using async compaction.
 			 */
-			migration_mode = MIGRATE_ASYNC;
+			compact_priority = INIT_COMPACT_PRIORITY;
 		}
 	}
 
@@ -3693,8 +3694,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 
 	/* Try direct compaction and then allocating */
 	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
-					migration_mode,
-					&compact_result);
+					compact_priority, &compact_result);
 	if (page)
 		goto got_pg;
 
@@ -3734,7 +3734,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 */
 	if (did_some_progress > 0 &&
 			should_compact_retry(ac, order, alloc_flags,
-				compact_result, &migration_mode,
+				compact_result, &compact_priority,
 				compaction_retries))
 		goto retry;
 
-- 
2.8.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 10/17] mm, compaction: cleanup unused functions
  2016-06-24  9:54 ` [PATCH v3 10/17] mm, compaction: cleanup unused functions Vlastimil Babka
@ 2016-06-24 11:53   ` Vlastimil Babka
  0 siblings, 0 replies; 37+ messages in thread
From: Vlastimil Babka @ 2016-06-24 11:53 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, Michal Hocko, Mel Gorman, Joonsoo Kim,
	David Rientjes, Rik van Riel

On 06/24/2016 11:54 AM, Vlastimil Babka wrote:
> Since kswapd compaction moved to kcompactd, compact_pgdat() is not called
> anymore, so we remove it. The only caller of __compact_pgdat() is
> compact_node(), so we merge them and remove code that was only reachable from
> kswapd.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Michal Hocko <mhocko@suse.com>

Patch with updated context to apply after the fixed 07/17 patch:
----8<----
>From c330b549db9060193672749b48782a7a1d011641 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Thu, 5 May 2016 15:46:25 +0200
Subject: [PATCH v3 10/17] mm, compaction: cleanup unused functions

Since kswapd compaction moved to kcompactd, compact_pgdat() is not called
anymore, so we remove it. The only caller of __compact_pgdat() is
compact_node(), so we merge them and remove code that was only reachable from
kswapd.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/compaction.h |  5 ----
 mm/compaction.c            | 60 +++++++++++++---------------------------------
 2 files changed, 17 insertions(+), 48 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index d4e106b5dc27..1bb58581301c 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -70,7 +70,6 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
 		unsigned int order, unsigned int alloc_flags,
 		const struct alloc_context *ac, enum compact_priority prio);
-extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern enum compact_result compaction_suitable(struct zone *zone, int order,
 		unsigned int alloc_flags, int classzone_idx);
@@ -154,10 +153,6 @@ extern void kcompactd_stop(int nid);
 extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
 
 #else
-static inline void compact_pgdat(pg_data_t *pgdat, int order)
-{
-}
-
 static inline void reset_isolation_suitable(pg_data_t *pgdat)
 {
 }
diff --git a/mm/compaction.c b/mm/compaction.c
index e7fe848e318e..5c15db0001a5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1736,10 +1736,18 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 
 
 /* Compact all zones within a node */
-static void __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
+static void compact_node(int nid)
 {
+	pg_data_t *pgdat = NODE_DATA(nid);
 	int zoneid;
 	struct zone *zone;
+	struct compact_control cc = {
+		.order = -1,
+		.mode = MIGRATE_SYNC,
+		.ignore_skip_hint = true,
+		.whole_zone = true,
+	};
+
 
 	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
 
@@ -1747,53 +1755,19 @@ static void __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
 		if (!populated_zone(zone))
 			continue;
 
-		cc->nr_freepages = 0;
-		cc->nr_migratepages = 0;
-		cc->zone = zone;
-		INIT_LIST_HEAD(&cc->freepages);
-		INIT_LIST_HEAD(&cc->migratepages);
-
-		if (is_via_compact_memory(cc->order) ||
-				!compaction_deferred(zone, cc->order))
-			compact_zone(zone, cc);
-
-		VM_BUG_ON(!list_empty(&cc->freepages));
-		VM_BUG_ON(!list_empty(&cc->migratepages));
+		cc.nr_freepages = 0;
+		cc.nr_migratepages = 0;
+		cc.zone = zone;
+		INIT_LIST_HEAD(&cc.freepages);
+		INIT_LIST_HEAD(&cc.migratepages);
 
-		if (is_via_compact_memory(cc->order))
-			continue;
+		compact_zone(zone, &cc);
 
-		if (zone_watermark_ok(zone, cc->order,
-				low_wmark_pages(zone), 0, 0))
-			compaction_defer_reset(zone, cc->order, false);
+		VM_BUG_ON(!list_empty(&cc.freepages));
+		VM_BUG_ON(!list_empty(&cc.migratepages));
 	}
 }
 
-void compact_pgdat(pg_data_t *pgdat, int order)
-{
-	struct compact_control cc = {
-		.order = order,
-		.mode = MIGRATE_ASYNC,
-	};
-
-	if (!order)
-		return;
-
-	__compact_pgdat(pgdat, &cc);
-}
-
-static void compact_node(int nid)
-{
-	struct compact_control cc = {
-		.order = -1,
-		.mode = MIGRATE_SYNC,
-		.ignore_skip_hint = true,
-		.whole_zone = true,
-	};
-
-	__compact_pgdat(NODE_DATA(nid), &cc);
-}
-
 /* Compact all nodes in the system */
 static void compact_nodes(void)
 {
-- 
2.8.4

^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 02/17] mm, page_alloc: set alloc_flags only once in slowpath
  2016-06-24  9:54 ` [PATCH v3 02/17] mm, page_alloc: set alloc_flags only once in slowpath Vlastimil Babka
@ 2016-06-30 14:44   ` Michal Hocko
  0 siblings, 0 replies; 37+ messages in thread
From: Michal Hocko @ 2016-06-30 14:44 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-kernel, linux-mm, Mel Gorman, Joonsoo Kim,
	David Rientjes, Rik van Riel

On Fri 24-06-16 11:54:22, Vlastimil Babka wrote:
> In __alloc_pages_slowpath(), alloc_flags doesn't change after it's initialized,
> so move the initialization above the retry: label. Also make the comment above
> the initialization more descriptive.
> 
> The only exception in the alloc_flags being constant is ALLOC_NO_WATERMARKS,
> which may change due to TIF_MEMDIE being set on the allocating thread. We can
> fix this, and make the code simpler and a bit more effective at the same time,
> by moving the part that determines ALLOC_NO_WATERMARKS from
> gfp_to_alloc_flags() to gfp_pfmemalloc_allowed(). This means we don't have to
> mask out ALLOC_NO_WATERMARKS in numerous places in __alloc_pages_slowpath()
> anymore. The only two tests for the flag can instead call
> gfp_pfmemalloc_allowed().

As already said in the previous version. Really a nice cleanup!

> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 52 ++++++++++++++++++++++++++--------------------------
>  1 file changed, 26 insertions(+), 26 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 89128d64d662..82545274adbe 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3193,8 +3193,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  	 */
>  	count_vm_event(COMPACTSTALL);
>  
> -	page = get_page_from_freelist(gfp_mask, order,
> -					alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
> +	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
>  
>  	if (page) {
>  		struct zone *zone = page_zone(page);
> @@ -3362,8 +3361,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>  		return NULL;
>  
>  retry:
> -	page = get_page_from_freelist(gfp_mask, order,
> -					alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
> +	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
>  
>  	/*
>  	 * If an allocation failed after direct reclaim, it could be because
> @@ -3421,16 +3419,6 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	} else if (unlikely(rt_task(current)) && !in_interrupt())
>  		alloc_flags |= ALLOC_HARDER;
>  
> -	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> -		if (gfp_mask & __GFP_MEMALLOC)
> -			alloc_flags |= ALLOC_NO_WATERMARKS;
> -		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
> -			alloc_flags |= ALLOC_NO_WATERMARKS;
> -		else if (!in_interrupt() &&
> -				((current->flags & PF_MEMALLOC) ||
> -				 unlikely(test_thread_flag(TIF_MEMDIE))))
> -			alloc_flags |= ALLOC_NO_WATERMARKS;
> -	}
>  #ifdef CONFIG_CMA
>  	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
>  		alloc_flags |= ALLOC_CMA;
> @@ -3440,7 +3428,19 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  
>  bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  {
> -	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
> +	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
> +		return false;
> +
> +	if (gfp_mask & __GFP_MEMALLOC)
> +		return true;
> +	if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
> +		return true;
> +	if (!in_interrupt() &&
> +			((current->flags & PF_MEMALLOC) ||
> +			 unlikely(test_thread_flag(TIF_MEMDIE))))
> +		return true;
> +
> +	return false;
>  }
>  
>  static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
> @@ -3575,36 +3575,36 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
>  		gfp_mask &= ~__GFP_ATOMIC;
>  
> -retry:
> -	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> -		wake_all_kswapds(order, ac);
> -
>  	/*
> -	 * OK, we're below the kswapd watermark and have kicked background
> -	 * reclaim. Now things get more complex, so set up alloc_flags according
> -	 * to how we want to proceed.
> +	 * The fast path uses conservative alloc_flags to succeed only until
> +	 * kswapd needs to be woken up, and to avoid the cost of setting up
> +	 * alloc_flags precisely. So we do that now.
>  	 */
>  	alloc_flags = gfp_to_alloc_flags(gfp_mask);
>  
> +retry:
> +	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> +		wake_all_kswapds(order, ac);
> +
>  	/*
>  	 * Reset the zonelist iterators if memory policies can be ignored.
>  	 * These allocations are high priority and system rather than user
>  	 * orientated.
>  	 */
> -	if ((alloc_flags & ALLOC_NO_WATERMARKS) || !(alloc_flags & ALLOC_CPUSET)) {
> +	if (!(alloc_flags & ALLOC_CPUSET) || gfp_pfmemalloc_allowed(gfp_mask)) {
>  		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
>  		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
>  					ac->high_zoneidx, ac->nodemask);
>  	}
>  
>  	/* This is the last chance, in general, before the goto nopage. */
> -	page = get_page_from_freelist(gfp_mask, order,
> -				alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
> +	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
>  	if (page)
>  		goto got_pg;
>  
>  	/* Allocate without watermarks if the context allows */
> -	if (alloc_flags & ALLOC_NO_WATERMARKS) {
> +	if (gfp_pfmemalloc_allowed(gfp_mask)) {
> +
>  		page = get_page_from_freelist(gfp_mask, order,
>  						ALLOC_NO_WATERMARKS, ac);
>  		if (page)
> -- 
> 2.8.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 03/17] mm, page_alloc: don't retry initial attempt in slowpath
  2016-06-24  9:54 ` [PATCH v3 03/17] mm, page_alloc: don't retry initial attempt " Vlastimil Babka
@ 2016-06-30 15:03   ` Michal Hocko
  0 siblings, 0 replies; 37+ messages in thread
From: Michal Hocko @ 2016-06-30 15:03 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-kernel, linux-mm, Mel Gorman, Joonsoo Kim,
	David Rientjes, Rik van Riel

On Fri 24-06-16 11:54:23, Vlastimil Babka wrote:
> After __alloc_pages_slowpath() sets up new alloc_flags and wakes up kswapd, it
> first tries get_page_from_freelist() with the new alloc_flags, as it may
> succeed e.g. due to using min watermark instead of low watermark. It makes
> sense to to do this attempt before adjusting zonelist based on
> alloc_flags/gfp_mask, as it's still relatively a fast path if we just wake up
> kswapd and successfully allocate.
> 
> This patch therefore moves the initial attempt above the retry label and
> reorganizes a bit the part below the retry label. We still have to attempt
> get_page_from_freelist() on each retry, as some allocations cannot do that
> as part of direct reclaim or compaction, and yet are not allowed to fail
> (even though they do a WARN_ON_ONCE() and thus should not exist). We can reuse
> the call meant for ALLOC_NO_WATERMARKS attempt and just set alloc_flags to
> ALLOC_NO_WATERMARKS if the context allows it. As a side-effect, the attempts
> from direct reclaim/compaction will also no longer obey watermarks once this
> is set, but there's little harm in that.
> 
> Kswapd wakeups are also done on each retry to be safe from potential races
> resulting in kswapd going to sleep while a process (that may not be able to
> reclaim by itself) is still looping.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 29 ++++++++++++++++++-----------
>  1 file changed, 18 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 82545274adbe..06cfa4bb807d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3582,35 +3582,42 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	 */
>  	alloc_flags = gfp_to_alloc_flags(gfp_mask);
>  
> +	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> +		wake_all_kswapds(order, ac);
> +
> +	/*
> +	 * The adjusted alloc_flags might result in immediate success, so try
> +	 * that first
> +	 */
> +	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
> +	if (page)
> +		goto got_pg;
> +
> +
>  retry:
> +	/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
>  	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
>  		wake_all_kswapds(order, ac);
>  
> +	if (gfp_pfmemalloc_allowed(gfp_mask))
> +		alloc_flags = ALLOC_NO_WATERMARKS;
> +
>  	/*
>  	 * Reset the zonelist iterators if memory policies can be ignored.
>  	 * These allocations are high priority and system rather than user
>  	 * orientated.
>  	 */
> -	if (!(alloc_flags & ALLOC_CPUSET) || gfp_pfmemalloc_allowed(gfp_mask)) {
> +	if (!(alloc_flags & ALLOC_CPUSET) || (alloc_flags & ALLOC_NO_WATERMARKS)) {
>  		ac->zonelist = node_zonelist(numa_node_id(), gfp_mask);
>  		ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
>  					ac->high_zoneidx, ac->nodemask);
>  	}
>  
> -	/* This is the last chance, in general, before the goto nopage. */
> +	/* Attempt with potentially adjusted zonelist and alloc_flags */
>  	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
>  	if (page)
>  		goto got_pg;
>  
> -	/* Allocate without watermarks if the context allows */
> -	if (gfp_pfmemalloc_allowed(gfp_mask)) {
> -
> -		page = get_page_from_freelist(gfp_mask, order,
> -						ALLOC_NO_WATERMARKS, ac);
> -		if (page)
> -			goto got_pg;
> -	}
> -
>  	/* Caller is not willing to reclaim, we can't balance anything */
>  	if (!can_direct_reclaim) {
>  		/*
> -- 
> 2.8.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 09/17] mm, compaction: make whole_zone flag ignore cached scanner positions
  2016-06-24  9:54 ` [PATCH v3 09/17] mm, compaction: make whole_zone flag ignore cached scanner positions Vlastimil Babka
@ 2016-07-06  5:09   ` Joonsoo Kim
  2016-07-18  9:12     ` Vlastimil Babka
  0 siblings, 1 reply; 37+ messages in thread
From: Joonsoo Kim @ 2016-07-06  5:09 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-kernel, linux-mm, Michal Hocko, Mel Gorman,
	David Rientjes, Rik van Riel

On Fri, Jun 24, 2016 at 11:54:29AM +0200, Vlastimil Babka wrote:
> A recent patch has added whole_zone flag that compaction sets when scanning
> starts from the zone boundary, in order to report that zone has been fully
> scanned in one attempt. For allocations that want to try really hard or cannot
> fail, we will want to introduce a mode where scanning whole zone is guaranteed
> regardless of the cached positions.
> 
> This patch reuses the whole_zone flag in a way that if it's already passed true
> to compaction, the cached scanner positions are ignored. Employing this flag

Okay. But, please don't reset cached scanner position even if whole_zone
flag is set. Just set cc->migrate_pfn and free_pfn, appropriately. With
your following patches, whole_zone could be set without any compaction
try so there is no point to reset cached scanner position in this
case.

Thanks.

> during reclaim/compaction loop will be done in the next patch. This patch
> however converts compaction invoked from userspace via procfs to use this flag.
> Before this patch, the cached positions were first reset to zone boundaries and
> then read back from struct zone, so there was a window where a parallel
> compaction could replace the reset values, making the manual compaction less
> effective. Using the flag instead of performing reset is more robust.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/compaction.c | 15 +++++----------
>  mm/internal.h   |  2 +-
>  2 files changed, 6 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index f825a58bc37c..e7fe848e318e 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1501,11 +1501,13 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
>  	 */
>  	cc->migrate_pfn = zone->compact_cached_migrate_pfn[sync];
>  	cc->free_pfn = zone->compact_cached_free_pfn;
> -	if (cc->free_pfn < start_pfn || cc->free_pfn >= end_pfn) {
> +	if (cc->whole_zone || cc->free_pfn < start_pfn ||
> +						cc->free_pfn >= end_pfn) {
>  		cc->free_pfn = pageblock_start_pfn(end_pfn - 1);
>  		zone->compact_cached_free_pfn = cc->free_pfn;
>  	}
> -	if (cc->migrate_pfn < start_pfn || cc->migrate_pfn >= end_pfn) {
> +	if (cc->whole_zone || cc->migrate_pfn < start_pfn ||
> +						cc->migrate_pfn >= end_pfn) {
>  		cc->migrate_pfn = start_pfn;
>  		zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn;
>  		zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
> @@ -1751,14 +1753,6 @@ static void __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
>  		INIT_LIST_HEAD(&cc->freepages);
>  		INIT_LIST_HEAD(&cc->migratepages);
>  
> -		/*
> -		 * When called via /proc/sys/vm/compact_memory
> -		 * this makes sure we compact the whole zone regardless of
> -		 * cached scanner positions.
> -		 */
> -		if (is_via_compact_memory(cc->order))
> -			__reset_isolation_suitable(zone);
> -
>  		if (is_via_compact_memory(cc->order) ||
>  				!compaction_deferred(zone, cc->order))
>  			compact_zone(zone, cc);
> @@ -1794,6 +1788,7 @@ static void compact_node(int nid)
>  		.order = -1,
>  		.mode = MIGRATE_SYNC,
>  		.ignore_skip_hint = true,
> +		.whole_zone = true,
>  	};
>  
>  	__compact_pgdat(NODE_DATA(nid), &cc);
> diff --git a/mm/internal.h b/mm/internal.h
> index 680e5ce2ab37..153bb52335b4 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -179,7 +179,7 @@ struct compact_control {
>  	enum migrate_mode mode;		/* Async or sync migration mode */
>  	bool ignore_skip_hint;		/* Scan blocks even if marked skip */
>  	bool direct_compaction;		/* False from kcompactd or /proc/... */
> -	bool whole_zone;		/* Whole zone has been scanned */
> +	bool whole_zone;		/* Whole zone should/has been scanned */
>  	int order;			/* order a direct compactor needs */
>  	const gfp_t gfp_mask;		/* gfp mask of a direct compactor */
>  	const unsigned int alloc_flags;	/* alloc flags of a direct compactor */
> -- 
> 2.8.4

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 12/17] mm, compaction: more reliably increase direct compaction priority
  2016-06-24  9:54 ` [PATCH v3 12/17] mm, compaction: more reliably increase " Vlastimil Babka
@ 2016-07-06  5:39   ` Joonsoo Kim
  2016-07-15 13:37     ` Vlastimil Babka
  0 siblings, 1 reply; 37+ messages in thread
From: Joonsoo Kim @ 2016-07-06  5:39 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-kernel, linux-mm, Michal Hocko, Mel Gorman,
	David Rientjes, Rik van Riel

On Fri, Jun 24, 2016 at 11:54:32AM +0200, Vlastimil Babka wrote:
> During reclaim/compaction loop, compaction priority can be increased by the
> should_compact_retry() function, but the current code is not optimal. Priority
> is only increased when compaction_failed() is true, which means that compaction
> has scanned the whole zone. This may not happen even after multiple attempts
> with the lower priority due to parallel activity, so we might needlessly
> struggle on the lower priority and possibly run out of compaction retry
> attempts in the process.
> 
> We can remove these corner cases by increasing compaction priority regardless
> of compaction_failed(). Examining further the compaction result can be
> postponed only after reaching the highest priority. This is a simple solution
> and we don't need to worry about reaching the highest priority "too soon" here,
> because hen should_compact_retry() is called it means that the system is
> already struggling and the allocation is supposed to either try as hard as
> possible, or it cannot fail at all. There's not much point staying at lower
> priorities with heuristics that may result in only partial compaction.
> Also we now count compaction retries only after reaching the highest priority.

I'm not sure that this patch is safe. Deferring and skip-bit in
compaction is highly related to reclaim/compaction. Just ignoring them and (almost)
unconditionally increasing compaction priority will result in less
reclaim and less success rate on compaction. And, as a necessarily, it
would trigger OOM more frequently.

It would not be your fault. This patch is reasonable in current
situation. It just makes current things more deterministic
although I dislike that current things and this patch would amplify
those problem.

Thanks.

> The only exception here is the COMPACT_SKIPPED result, which means that
> compaction could not run at all due to being below order-0 watermarks. In that
> case, don't increase compaction priority, and check if compaction could proceed
> when everything reclaimable was reclaimed. Before this patch, this was tied to
> compaction_withdrawn(), but the other results considered there are in fact only
> possible due to low compaction priority so we can ignore them thanks to the
> patch. Since there are no other callers of compaction_withdrawn(), change its
> semantics to remove the low priority scenarios.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  include/linux/compaction.h | 28 ++-----------------------
>  mm/page_alloc.c            | 51 ++++++++++++++++++++++++++--------------------
>  2 files changed, 31 insertions(+), 48 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 869b594cf4ff..a6b3d5d2ae53 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -106,8 +106,8 @@ static inline bool compaction_failed(enum compact_result result)
>  }
>  
>  /*
> - * Compaction  has backed off for some reason. It might be throttling or
> - * lock contention. Retrying is still worthwhile.
> + * Compaction has backed off because it cannot proceed until there is enough
> + * free memory. Retrying is still worthwhile after reclaim.
>   */
>  static inline bool compaction_withdrawn(enum compact_result result)
>  {
> @@ -118,30 +118,6 @@ static inline bool compaction_withdrawn(enum compact_result result)
>  	if (result == COMPACT_SKIPPED)
>  		return true;
>  
> -	/*
> -	 * If compaction is deferred for high-order allocations, it is
> -	 * because sync compaction recently failed. If this is the case
> -	 * and the caller requested a THP allocation, we do not want
> -	 * to heavily disrupt the system, so we fail the allocation
> -	 * instead of entering direct reclaim.
> -	 */
> -	if (result == COMPACT_DEFERRED)
> -		return true;
> -
> -	/*
> -	 * If compaction in async mode encounters contention or blocks higher
> -	 * priority task we back off early rather than cause stalls.
> -	 */
> -	if (result == COMPACT_CONTENDED)
> -		return true;
> -
> -	/*
> -	 * Page scanners have met but we haven't scanned full zones so this
> -	 * is a back off in fact.
> -	 */
> -	if (result == COMPACT_PARTIAL_SKIPPED)
> -		return true;
> -
>  	return false;
>  }
>  
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 204cc988fd64..e1efdc8d2a52 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3217,7 +3217,7 @@ static inline bool
>  should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
>  		     enum compact_result compact_result,
>  		     enum compact_priority *compact_priority,
> -		     int compaction_retries)
> +		     int *compaction_retries)
>  {
>  	int max_retries = MAX_COMPACT_RETRIES;
>  
> @@ -3225,28 +3225,35 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
>  		return false;
>  
>  	/*
> -	 * compaction considers all the zone as desperately out of memory
> -	 * so it doesn't really make much sense to retry except when the
> -	 * failure could be caused by insufficient priority
> +	 * Compaction backed off due to watermark checks for order-0
> +	 * so the regular reclaim has to try harder and reclaim something
> +	 * Retry only if it looks like reclaim might have a chance.
>  	 */
> -	if (compaction_failed(compact_result)) {
> -		if (*compact_priority > MIN_COMPACT_PRIORITY) {
> -			(*compact_priority)--;
> -			return true;
> -		}
> -		return false;
> +	if (compaction_withdrawn(compact_result))
> +		return compaction_zonelist_suitable(ac, order, alloc_flags);
> +
> +	/*
> +	 * Compaction could have withdrawn early or skip some zones or
> +	 * pageblocks. We were asked to retry, which means the allocation
> +	 * should try really hard, so increase the priority if possible.
> +	 */
> +	if (*compact_priority > MIN_COMPACT_PRIORITY) {
> +		(*compact_priority)--;
> +		return true;
>  	}
>  
>  	/*
> -	 * make sure the compaction wasn't deferred or didn't bail out early
> -	 * due to locks contention before we declare that we should give up.
> -	 * But do not retry if the given zonelist is not suitable for
> -	 * compaction.
> +	 * Compaction considers all the zones as unfixably fragmented and we
> +	 * are on the highest priority, which means it can't be due to
> +	 * heuristics and it doesn't really make much sense to retry.
>  	 */
> -	if (compaction_withdrawn(compact_result))
> -		return compaction_zonelist_suitable(ac, order, alloc_flags);
> +	if (compaction_failed(compact_result))
> +		return false;
>  
>  	/*
> +	 * The remaining possibility is that compaction made progress and
> +	 * created a high-order page, but it was allocated by somebody else.
> +	 * To prevent thrashing, limit the number of retries in such case.
>  	 * !costly requests are much more important than __GFP_REPEAT
>  	 * costly ones because they are de facto nofail and invoke OOM
>  	 * killer to move on while costly can fail and users are ready
> @@ -3254,9 +3261,12 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
>  	 * would need much more detailed feedback from compaction to
>  	 * make a better decision.
>  	 */
> +	if (compaction_made_progress(compact_result))
> +		(*compaction_retries)++;
> +
>  	if (order > PAGE_ALLOC_COSTLY_ORDER)
>  		max_retries /= 4;
> -	if (compaction_retries <= max_retries)
> +	if (*compaction_retries <= max_retries)
>  		return true;
>  
>  	return false;
> @@ -3275,7 +3285,7 @@ static inline bool
>  should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_flags,
>  		     enum compact_result compact_result,
>  		     enum compact_priority *compact_priority,
> -		     int compaction_retries)
> +		     int *compaction_retries)
>  {
>  	struct zone *zone;
>  	struct zoneref *z;
> @@ -3672,9 +3682,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	if (page)
>  		goto got_pg;
>  
> -	if (order && compaction_made_progress(compact_result))
> -		compaction_retries++;
> -
>  	/* Do not loop if specifically requested */
>  	if (gfp_mask & __GFP_NORETRY)
>  		goto nopage;
> @@ -3709,7 +3716,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	if (did_some_progress > 0 &&
>  			should_compact_retry(ac, order, alloc_flags,
>  				compact_result, &compact_priority,
> -				compaction_retries))
> +				&compaction_retries))
>  		goto retry;
>  
>  	/* Reclaim has failed us, start killing things */
> -- 
> 2.8.4
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 13/17] mm, compaction: use correct watermark when checking allocation success
  2016-06-24  9:54 ` [PATCH v3 13/17] mm, compaction: use correct watermark when checking allocation success Vlastimil Babka
@ 2016-07-06  5:47   ` Joonsoo Kim
  2016-07-18  9:23     ` Vlastimil Babka
  0 siblings, 1 reply; 37+ messages in thread
From: Joonsoo Kim @ 2016-07-06  5:47 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-kernel, linux-mm, Michal Hocko, Mel Gorman,
	David Rientjes, Rik van Riel

On Fri, Jun 24, 2016 at 11:54:33AM +0200, Vlastimil Babka wrote:
> The __compact_finished() function uses low watermark in a check that has to
> pass if the direct compaction is to finish and allocation should succeed. This
> is too pessimistic, as the allocation will typically use min watermark. It may
> happen that during compaction, we drop below the low watermark (due to parallel
> activity), but still form the target high-order page. By checking against low
> watermark, we might needlessly continue compaction.
> 
> Similarly, __compaction_suitable() uses low watermark in a check whether
> allocation can succeed without compaction. Again, this is unnecessarily
> pessimistic.
> 
> After this patch, these check will use direct compactor's alloc_flags to
> determine the watermark, which is effectively the min watermark.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/compaction.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 76897850c3c2..371760a85085 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1320,7 +1320,7 @@ static enum compact_result __compact_finished(struct zone *zone, struct compact_
>  		return COMPACT_CONTINUE;
>  
>  	/* Compaction run is not finished if the watermark is not met */
> -	watermark = low_wmark_pages(zone);
> +	watermark = zone->watermark[cc->alloc_flags & ALLOC_WMARK_MASK];

finish condition is changed. We have two more watermark checks in
try_to_compact_pages() and kcompactd_do_work() and they should be
changed too.

Thanks.
>  
>  	if (!zone_watermark_ok(zone, cc->order, watermark, cc->classzone_idx,
>  							cc->alloc_flags))
> @@ -1385,7 +1385,7 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
>  	if (is_via_compact_memory(order))
>  		return COMPACT_CONTINUE;
>  
> -	watermark = low_wmark_pages(zone);
> +	watermark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
>  	/*
>  	 * If watermarks for high-order allocation are already met, there
>  	 * should be no need for compaction at all.
> @@ -1399,7 +1399,7 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
>  	 * This is because during migration, copies of pages need to be
>  	 * allocated and for a short time, the footprint is higher
>  	 */
> -	watermark += (2UL << order);
> +	watermark = low_wmark_pages(zone) + (2UL << order);
>  	if (!__zone_watermark_ok(zone, 0, watermark, classzone_idx,
>  				 alloc_flags, wmark_target))
>  		return COMPACT_SKIPPED;
> -- 
> 2.8.4

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 17/17] mm, vmscan: make compaction_ready() more accurate and readable
  2016-06-24  9:54 ` [PATCH v3 17/17] mm, vmscan: make compaction_ready() more accurate and readable Vlastimil Babka
@ 2016-07-06  5:55   ` Joonsoo Kim
  2016-07-18 11:48     ` Vlastimil Babka
  0 siblings, 1 reply; 37+ messages in thread
From: Joonsoo Kim @ 2016-07-06  5:55 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-kernel, linux-mm, Michal Hocko, Mel Gorman,
	David Rientjes, Rik van Riel

On Fri, Jun 24, 2016 at 11:54:37AM +0200, Vlastimil Babka wrote:
> The compaction_ready() is used during direct reclaim for costly order
> allocations to skip reclaim for zones where compaction should be attempted
> instead. It's combining the standard compaction_suitable() check with its own
> watermark check based on high watermark with extra gap, and the result is
> confusing at best.
> 
> This patch attempts to better structure and document the checks involved.
> First, compaction_suitable() can determine that the allocation should either
> succeed already, or that compaction doesn't have enough free pages to proceed.
> The third possibility is that compaction has enough free pages, but we still
> decide to reclaim first - unless we are already above the high watermark with
> gap.  This does not mean that the reclaim will actually reach this watermark
> during single attempt, this is rather an over-reclaim protection. So document
> the code as such. The check for compaction_deferred() is removed completely, as
> it in fact had no proper role here.
> 
> The result after this patch is mainly a less confusing code. We also skip some
> over-reclaim in cases where the allocation should already succed.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Acked-by: Michal Hocko <mhocko@suse.com>
> ---
>  mm/vmscan.c | 43 ++++++++++++++++++++-----------------------
>  1 file changed, 20 insertions(+), 23 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 484ff05d5a8f..724131661f0c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2462,40 +2462,37 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
>  }
>  
>  /*
> - * Returns true if compaction should go ahead for a high-order request, or
> - * the high-order allocation would succeed without compaction.
> + * Returns true if compaction should go ahead for a costly-order request, or
> + * the allocation would already succeed without compaction. Return false if we
> + * should reclaim first.
>   */
>  static inline bool compaction_ready(struct zone *zone, int order, int classzone_idx)
>  {
>  	unsigned long balance_gap, watermark;
> -	bool watermark_ok;
> +	enum compact_result suitable;
> +
> +	suitable = compaction_suitable(zone, order, 0, classzone_idx);
> +	if (suitable == COMPACT_PARTIAL)
> +		/* Allocation should succeed already. Don't reclaim. */
> +		return true;
> +	if (suitable == COMPACT_SKIPPED)
> +		/* Compaction cannot yet proceed. Do reclaim. */
> +		return false;
>  
>  	/*
> -	 * Compaction takes time to run and there are potentially other
> -	 * callers using the pages just freed. Continue reclaiming until
> -	 * there is a buffer of free pages available to give compaction
> -	 * a reasonable chance of completing and allocating the page
> +	 * Compaction is already possible, but it takes time to run and there
> +	 * are potentially other callers using the pages just freed. So proceed
> +	 * with reclaim to make a buffer of free pages available to give
> +	 * compaction a reasonable chance of completing and allocating the page.
> +	 * Note that we won't actually reclaim the whole buffer in one attempt
> +	 * as the target watermark in should_continue_reclaim() is lower. But if
> +	 * we are already above the high+gap watermark, don't reclaim at all.
>  	 */
>  	balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
>  			zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
>  	watermark = high_wmark_pages(zone) + balance_gap + compact_gap(order);
> -	watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, classzone_idx);

Hmm... it doesn't explain why both high_wmark_pages and balance_gap
are needed. If we want to make a buffer, one of them would work.

Thanks.

> -
> -	/*
> -	 * If compaction is deferred, reclaim up to a point where
> -	 * compaction will have a chance of success when re-enabled
> -	 */
> -	if (compaction_deferred(zone, order))
> -		return watermark_ok;
> -
> -	/*
> -	 * If compaction is not ready to start and allocation is not likely
> -	 * to succeed without it, then keep reclaiming.
> -	 */
> -	if (compaction_suitable(zone, order, 0, classzone_idx) == COMPACT_SKIPPED)
> -		return false;
>  
> -	return watermark_ok;
> +	return zone_watermark_ok_safe(zone, 0, watermark, classzone_idx);
>  }
>  
>  /*
> -- 
> 2.8.4

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 12/17] mm, compaction: more reliably increase direct compaction priority
  2016-07-06  5:39   ` Joonsoo Kim
@ 2016-07-15 13:37     ` Vlastimil Babka
  2016-07-18  4:41       ` Joonsoo Kim
  0 siblings, 1 reply; 37+ messages in thread
From: Vlastimil Babka @ 2016-07-15 13:37 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michal Hocko, Mel Gorman,
	David Rientjes, Rik van Riel

On 07/06/2016 07:39 AM, Joonsoo Kim wrote:
> On Fri, Jun 24, 2016 at 11:54:32AM +0200, Vlastimil Babka wrote:
>> During reclaim/compaction loop, compaction priority can be increased by the
>> should_compact_retry() function, but the current code is not optimal. Priority
>> is only increased when compaction_failed() is true, which means that compaction
>> has scanned the whole zone. This may not happen even after multiple attempts
>> with the lower priority due to parallel activity, so we might needlessly
>> struggle on the lower priority and possibly run out of compaction retry
>> attempts in the process.
>>
>> We can remove these corner cases by increasing compaction priority regardless
>> of compaction_failed(). Examining further the compaction result can be
>> postponed only after reaching the highest priority. This is a simple solution
>> and we don't need to worry about reaching the highest priority "too soon" here,
>> because hen should_compact_retry() is called it means that the system is
>> already struggling and the allocation is supposed to either try as hard as
>> possible, or it cannot fail at all. There's not much point staying at lower
>> priorities with heuristics that may result in only partial compaction.
>> Also we now count compaction retries only after reaching the highest priority.
> 
> I'm not sure that this patch is safe. Deferring and skip-bit in
> compaction is highly related to reclaim/compaction. Just ignoring them and (almost)
> unconditionally increasing compaction priority will result in less
> reclaim and less success rate on compaction.

I don't see why less reclaim? Reclaim is always attempted before
compaction and compaction priority doesn't affect it. And as long as
reclaim wants to retry, should_compact_retry() isn't even called, so the
priority stays. I wanted to change that in v1, but Michal suggested I
shouldn't.

> And, as a necessarily, it
> would trigger OOM more frequently.

OOM is only allowed for costly orders. If reclaim itself doesn't want to
retry for non-costly orders anymore, and we finally start calling
should_compact_retry(), then I guess the system is really struggling
already and eventual OOM wouldn't be premature?

> It would not be your fault. This patch is reasonable in current
> situation. It just makes current things more deterministic
> although I dislike that current things and this patch would amplify
> those problem.
> 
> Thanks.
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 12/17] mm, compaction: more reliably increase direct compaction priority
  2016-07-15 13:37     ` Vlastimil Babka
@ 2016-07-18  4:41       ` Joonsoo Kim
  2016-07-18 12:21         ` Vlastimil Babka
  0 siblings, 1 reply; 37+ messages in thread
From: Joonsoo Kim @ 2016-07-18  4:41 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-kernel, linux-mm, Michal Hocko, Mel Gorman,
	David Rientjes, Rik van Riel

On Fri, Jul 15, 2016 at 03:37:52PM +0200, Vlastimil Babka wrote:
> On 07/06/2016 07:39 AM, Joonsoo Kim wrote:
> > On Fri, Jun 24, 2016 at 11:54:32AM +0200, Vlastimil Babka wrote:
> >> During reclaim/compaction loop, compaction priority can be increased by the
> >> should_compact_retry() function, but the current code is not optimal. Priority
> >> is only increased when compaction_failed() is true, which means that compaction
> >> has scanned the whole zone. This may not happen even after multiple attempts
> >> with the lower priority due to parallel activity, so we might needlessly
> >> struggle on the lower priority and possibly run out of compaction retry
> >> attempts in the process.
> >>
> >> We can remove these corner cases by increasing compaction priority regardless
> >> of compaction_failed(). Examining further the compaction result can be
> >> postponed only after reaching the highest priority. This is a simple solution
> >> and we don't need to worry about reaching the highest priority "too soon" here,
> >> because hen should_compact_retry() is called it means that the system is
> >> already struggling and the allocation is supposed to either try as hard as
> >> possible, or it cannot fail at all. There's not much point staying at lower
> >> priorities with heuristics that may result in only partial compaction.
> >> Also we now count compaction retries only after reaching the highest priority.
> > 
> > I'm not sure that this patch is safe. Deferring and skip-bit in
> > compaction is highly related to reclaim/compaction. Just ignoring them and (almost)
> > unconditionally increasing compaction priority will result in less
> > reclaim and less success rate on compaction.
> 
> I don't see why less reclaim? Reclaim is always attempted before
> compaction and compaction priority doesn't affect it. And as long as
> reclaim wants to retry, should_compact_retry() isn't even called, so the
> priority stays. I wanted to change that in v1, but Michal suggested I
> shouldn't.

I assume the situation that there is no !costly highorder freepage
because of fragmentation. In this case, should_reclaim_retry() would
return false since watermark cannot be met due to absence of high
order freepage. Now, please see should_compact_retry() with assumption
that there are enough order-0 free pages. Reclaim/compaction is only
retried two times (SYNC_LIGHT and SYNC_FULL) with your patchset since
compaction_withdrawn() return false with enough freepages and
!COMPACT_SKIPPED.

But, before your patchset, COMPACT_PARTIAL_SKIPPED and
COMPACT_DEFERRED is considered as withdrawn so will retry
reclaim/compaction more times.

As I said before, more reclaim (more freepage) increase migration
scanner's scan range and then increase compaction success probability.
Therefore, your patchset which makes reclaim/compaction retry less times
deterministically would not be safe.

> 
> > And, as a necessarily, it
> > would trigger OOM more frequently.
> 
> OOM is only allowed for costly orders. If reclaim itself doesn't want to
> retry for non-costly orders anymore, and we finally start calling
> should_compact_retry(), then I guess the system is really struggling
> already and eventual OOM wouldn't be premature?

Premature is really subjective so I don't know. Anyway, I tested
your patchset with simple test case and it causes a regression.

My test setup is:

Mem: 512 MB
vm.compact_unevictable_allowed = 0
Mlocked Mem: 225 MB by using mlock(). With some tricks, mlocked pages are
spread so memory is highly fragmented.

fork 500

This test causes OOM with your patchset but not without your patchset.

Thanks.

> > It would not be your fault. This patch is reasonable in current
> > situation. It just makes current things more deterministic
> > although I dislike that current things and this patch would amplify
> > those problem.
> > 
> > Thanks.
> > 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 09/17] mm, compaction: make whole_zone flag ignore cached scanner positions
  2016-07-06  5:09   ` Joonsoo Kim
@ 2016-07-18  9:12     ` Vlastimil Babka
  2016-07-19  6:44       ` Joonsoo Kim
  0 siblings, 1 reply; 37+ messages in thread
From: Vlastimil Babka @ 2016-07-18  9:12 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michal Hocko, Mel Gorman,
	David Rientjes, Rik van Riel

On 07/06/2016 07:09 AM, Joonsoo Kim wrote:
> On Fri, Jun 24, 2016 at 11:54:29AM +0200, Vlastimil Babka wrote:
>> A recent patch has added whole_zone flag that compaction sets when scanning
>> starts from the zone boundary, in order to report that zone has been fully
>> scanned in one attempt. For allocations that want to try really hard or cannot
>> fail, we will want to introduce a mode where scanning whole zone is guaranteed
>> regardless of the cached positions.
>>
>> This patch reuses the whole_zone flag in a way that if it's already passed true
>> to compaction, the cached scanner positions are ignored. Employing this flag
>
> Okay. But, please don't reset cached scanner position even if whole_zone
> flag is set. Just set cc->migrate_pfn and free_pfn, appropriately. With

Won't that result in confusion on cached position updates during 
compaction where it checks the previous cached position? I wonder what 
kinds of corner cases it can bring...

> your following patches, whole_zone could be set without any compaction
> try

I don't understand what you mean here? Even after whole series, 
whole_zone is only checked, and positions thus reset, after passing the 
compaction_suitable() call from compact_zone(). So at that point we can 
say that compaction is being actually tried and it's not a drive-by reset?

Thanks

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 13/17] mm, compaction: use correct watermark when checking allocation success
  2016-07-06  5:47   ` Joonsoo Kim
@ 2016-07-18  9:23     ` Vlastimil Babka
  0 siblings, 0 replies; 37+ messages in thread
From: Vlastimil Babka @ 2016-07-18  9:23 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michal Hocko, Mel Gorman,
	David Rientjes, Rik van Riel

On 07/06/2016 07:47 AM, Joonsoo Kim wrote:
> On Fri, Jun 24, 2016 at 11:54:33AM +0200, Vlastimil Babka wrote:
>> The __compact_finished() function uses low watermark in a check that has to
>> pass if the direct compaction is to finish and allocation should succeed. This
>> is too pessimistic, as the allocation will typically use min watermark. It may
>> happen that during compaction, we drop below the low watermark (due to parallel
>> activity), but still form the target high-order page. By checking against low
>> watermark, we might needlessly continue compaction.
>>
>> Similarly, __compaction_suitable() uses low watermark in a check whether
>> allocation can succeed without compaction. Again, this is unnecessarily
>> pessimistic.
>>
>> After this patch, these check will use direct compactor's alloc_flags to
>> determine the watermark, which is effectively the min watermark.
>>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> Acked-by: Michal Hocko <mhocko@suse.com>
>> ---
>>  mm/compaction.c | 6 +++---
>>  1 file changed, 3 insertions(+), 3 deletions(-)
>>
>> diff --git a/mm/compaction.c b/mm/compaction.c
>> index 76897850c3c2..371760a85085 100644
>> --- a/mm/compaction.c
>> +++ b/mm/compaction.c
>> @@ -1320,7 +1320,7 @@ static enum compact_result __compact_finished(struct zone *zone, struct compact_
>>  		return COMPACT_CONTINUE;
>>
>>  	/* Compaction run is not finished if the watermark is not met */
>> -	watermark = low_wmark_pages(zone);
>> +	watermark = zone->watermark[cc->alloc_flags & ALLOC_WMARK_MASK];
>
> finish condition is changed. We have two more watermark checks in
> try_to_compact_pages() and kcompactd_do_work() and they should be
> changed too.

Ugh, I've completely missed them. Thanks for catching this, hopefully 
fixing that will improve the results.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 17/17] mm, vmscan: make compaction_ready() more accurate and readable
  2016-07-06  5:55   ` Joonsoo Kim
@ 2016-07-18 11:48     ` Vlastimil Babka
  0 siblings, 0 replies; 37+ messages in thread
From: Vlastimil Babka @ 2016-07-18 11:48 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michal Hocko, Mel Gorman,
	David Rientjes, Rik van Riel

On 07/06/2016 07:55 AM, Joonsoo Kim wrote:
> On Fri, Jun 24, 2016 at 11:54:37AM +0200, Vlastimil Babka wrote:
>> The compaction_ready() is used during direct reclaim for costly order
>> allocations to skip reclaim for zones where compaction should be attempted
>> instead. It's combining the standard compaction_suitable() check with its own
>> watermark check based on high watermark with extra gap, and the result is
>> confusing at best.
>>
>> This patch attempts to better structure and document the checks involved.
>> First, compaction_suitable() can determine that the allocation should either
>> succeed already, or that compaction doesn't have enough free pages to proceed.
>> The third possibility is that compaction has enough free pages, but we still
>> decide to reclaim first - unless we are already above the high watermark with
>> gap.  This does not mean that the reclaim will actually reach this watermark
>> during single attempt, this is rather an over-reclaim protection. So document
>> the code as such. The check for compaction_deferred() is removed completely, as
>> it in fact had no proper role here.
>>
>> The result after this patch is mainly a less confusing code. We also skip some
>> over-reclaim in cases where the allocation should already succed.
>>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> Acked-by: Michal Hocko <mhocko@suse.com>
>> ---
>>  mm/vmscan.c | 43 ++++++++++++++++++++-----------------------
>>  1 file changed, 20 insertions(+), 23 deletions(-)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 484ff05d5a8f..724131661f0c 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -2462,40 +2462,37 @@ static bool shrink_zone(struct zone *zone, struct scan_control *sc,
>>  }
>>
>>  /*
>> - * Returns true if compaction should go ahead for a high-order request, or
>> - * the high-order allocation would succeed without compaction.
>> + * Returns true if compaction should go ahead for a costly-order request, or
>> + * the allocation would already succeed without compaction. Return false if we
>> + * should reclaim first.
>>   */
>>  static inline bool compaction_ready(struct zone *zone, int order, int classzone_idx)
>>  {
>>  	unsigned long balance_gap, watermark;
>> -	bool watermark_ok;
>> +	enum compact_result suitable;
>> +
>> +	suitable = compaction_suitable(zone, order, 0, classzone_idx);
>> +	if (suitable == COMPACT_PARTIAL)
>> +		/* Allocation should succeed already. Don't reclaim. */
>> +		return true;
>> +	if (suitable == COMPACT_SKIPPED)
>> +		/* Compaction cannot yet proceed. Do reclaim. */
>> +		return false;
>>
>>  	/*
>> -	 * Compaction takes time to run and there are potentially other
>> -	 * callers using the pages just freed. Continue reclaiming until
>> -	 * there is a buffer of free pages available to give compaction
>> -	 * a reasonable chance of completing and allocating the page
>> +	 * Compaction is already possible, but it takes time to run and there
>> +	 * are potentially other callers using the pages just freed. So proceed
>> +	 * with reclaim to make a buffer of free pages available to give
>> +	 * compaction a reasonable chance of completing and allocating the page.
>> +	 * Note that we won't actually reclaim the whole buffer in one attempt
>> +	 * as the target watermark in should_continue_reclaim() is lower. But if
>> +	 * we are already above the high+gap watermark, don't reclaim at all.
>>  	 */
>>  	balance_gap = min(low_wmark_pages(zone), DIV_ROUND_UP(
>>  			zone->managed_pages, KSWAPD_ZONE_BALANCE_GAP_RATIO));
>>  	watermark = high_wmark_pages(zone) + balance_gap + compact_gap(order);
>> -	watermark_ok = zone_watermark_ok_safe(zone, 0, watermark, classzone_idx);
>
> Hmm... it doesn't explain why both high_wmark_pages and balance_gap
> are needed. If we want to make a buffer, one of them would work.

Mel's series removed KSWAPD_ZONE_BALANCE_GAP_RATIO meanwhile, so that 
should be fine.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 12/17] mm, compaction: more reliably increase direct compaction priority
  2016-07-18  4:41       ` Joonsoo Kim
@ 2016-07-18 12:21         ` Vlastimil Babka
  2016-07-19  4:53           ` Joonsoo Kim
  0 siblings, 1 reply; 37+ messages in thread
From: Vlastimil Babka @ 2016-07-18 12:21 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michal Hocko, Mel Gorman,
	David Rientjes, Rik van Riel

On 07/18/2016 06:41 AM, Joonsoo Kim wrote:
> On Fri, Jul 15, 2016 at 03:37:52PM +0200, Vlastimil Babka wrote:
>> On 07/06/2016 07:39 AM, Joonsoo Kim wrote:
>>> On Fri, Jun 24, 2016 at 11:54:32AM +0200, Vlastimil Babka wrote:
>>>> During reclaim/compaction loop, compaction priority can be increased by the
>>>> should_compact_retry() function, but the current code is not optimal. Priority
>>>> is only increased when compaction_failed() is true, which means that compaction
>>>> has scanned the whole zone. This may not happen even after multiple attempts
>>>> with the lower priority due to parallel activity, so we might needlessly
>>>> struggle on the lower priority and possibly run out of compaction retry
>>>> attempts in the process.
>>>>
>>>> We can remove these corner cases by increasing compaction priority regardless
>>>> of compaction_failed(). Examining further the compaction result can be
>>>> postponed only after reaching the highest priority. This is a simple solution
>>>> and we don't need to worry about reaching the highest priority "too soon" here,
>>>> because hen should_compact_retry() is called it means that the system is
>>>> already struggling and the allocation is supposed to either try as hard as
>>>> possible, or it cannot fail at all. There's not much point staying at lower
>>>> priorities with heuristics that may result in only partial compaction.
>>>> Also we now count compaction retries only after reaching the highest priority.
>>>
>>> I'm not sure that this patch is safe. Deferring and skip-bit in
>>> compaction is highly related to reclaim/compaction. Just ignoring them and (almost)
>>> unconditionally increasing compaction priority will result in less
>>> reclaim and less success rate on compaction.
>>
>> I don't see why less reclaim? Reclaim is always attempted before
>> compaction and compaction priority doesn't affect it. And as long as
>> reclaim wants to retry, should_compact_retry() isn't even called, so the
>> priority stays. I wanted to change that in v1, but Michal suggested I
>> shouldn't.
>
> I assume the situation that there is no !costly highorder freepage
> because of fragmentation. In this case, should_reclaim_retry() would
> return false since watermark cannot be met due to absence of high
> order freepage. Now, please see should_compact_retry() with assumption
> that there are enough order-0 free pages. Reclaim/compaction is only
> retried two times (SYNC_LIGHT and SYNC_FULL) with your patchset since
> compaction_withdrawn() return false with enough freepages and
> !COMPACT_SKIPPED.
>
> But, before your patchset, COMPACT_PARTIAL_SKIPPED and
> COMPACT_DEFERRED is considered as withdrawn so will retry
> reclaim/compaction more times.

Perhaps, but it wouldn't guarantee to reach the highest priority.

> As I said before, more reclaim (more freepage) increase migration
> scanner's scan range and then increase compaction success probability.
> Therefore, your patchset which makes reclaim/compaction retry less times
> deterministically would not be safe.

After the patchset, we are guaranteed a full compaction has happened. If 
that doesn't help, yeah maybe we can try reclaiming more... but where to 
draw the line? Reclaim everything for an order-3 allocation just to 
avoid OOM, ignoring that the system might be thrashing heavily? 
Previously it also wasn't guaranteed to reclaim everything, but what is 
the optimal number of retries?

>>
>>> And, as a necessarily, it
>>> would trigger OOM more frequently.
>>
>> OOM is only allowed for costly orders. If reclaim itself doesn't want to
>> retry for non-costly orders anymore, and we finally start calling
>> should_compact_retry(), then I guess the system is really struggling
>> already and eventual OOM wouldn't be premature?
>
> Premature is really subjective so I don't know. Anyway, I tested
> your patchset with simple test case and it causes a regression.
>
> My test setup is:
>
> Mem: 512 MB
> vm.compact_unevictable_allowed = 0
> Mlocked Mem: 225 MB by using mlock(). With some tricks, mlocked pages are
> spread so memory is highly fragmented.

So this testcase isn't really about compaction, as that can't do 
anything even on the full priority. Actually 
compaction_zonelist_suitable() lies to us because it's not really 
suitable. Even with more memory freed by reclaim, it cannot increase the 
chances of compaction (your argument above). Reclaim can only free the 
non-mlocked pages, but compaction can also migrate those.

> fork 500

So the 500 forked processes all wait until the whole forking is done and 
only afterwards they all exit? Or they exit right after fork (or some 
delay?) I would assume the latter otherwise it would fail even before my 
patchset. If the non-mlocked areas don't have enough highorder pages for 
all 500 stacks, it will OOM regardless of how many reclaim and 
compaction retries. But if the processes exit shortly after fork, the 
extra retries might help making time for recycling the freed stacks of 
exited processes. But is it an useful workload for demonstrating the 
regression then?

> This test causes OOM with your patchset but not without your patchset.
>
> Thanks.
>
>>> It would not be your fault. This patch is reasonable in current
>>> situation. It just makes current things more deterministic
>>> although I dislike that current things and this patch would amplify
>>> those problem.
>>>
>>> Thanks.
>>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 12/17] mm, compaction: more reliably increase direct compaction priority
  2016-07-18 12:21         ` Vlastimil Babka
@ 2016-07-19  4:53           ` Joonsoo Kim
  2016-07-19  7:42             ` Vlastimil Babka
  0 siblings, 1 reply; 37+ messages in thread
From: Joonsoo Kim @ 2016-07-19  4:53 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-kernel, linux-mm, Michal Hocko, Mel Gorman,
	David Rientjes, Rik van Riel

On Mon, Jul 18, 2016 at 02:21:02PM +0200, Vlastimil Babka wrote:
> On 07/18/2016 06:41 AM, Joonsoo Kim wrote:
> >On Fri, Jul 15, 2016 at 03:37:52PM +0200, Vlastimil Babka wrote:
> >>On 07/06/2016 07:39 AM, Joonsoo Kim wrote:
> >>>On Fri, Jun 24, 2016 at 11:54:32AM +0200, Vlastimil Babka wrote:
> >>>>During reclaim/compaction loop, compaction priority can be increased by the
> >>>>should_compact_retry() function, but the current code is not optimal. Priority
> >>>>is only increased when compaction_failed() is true, which means that compaction
> >>>>has scanned the whole zone. This may not happen even after multiple attempts
> >>>>with the lower priority due to parallel activity, so we might needlessly
> >>>>struggle on the lower priority and possibly run out of compaction retry
> >>>>attempts in the process.
> >>>>
> >>>>We can remove these corner cases by increasing compaction priority regardless
> >>>>of compaction_failed(). Examining further the compaction result can be
> >>>>postponed only after reaching the highest priority. This is a simple solution
> >>>>and we don't need to worry about reaching the highest priority "too soon" here,
> >>>>because hen should_compact_retry() is called it means that the system is
> >>>>already struggling and the allocation is supposed to either try as hard as
> >>>>possible, or it cannot fail at all. There's not much point staying at lower
> >>>>priorities with heuristics that may result in only partial compaction.
> >>>>Also we now count compaction retries only after reaching the highest priority.
> >>>
> >>>I'm not sure that this patch is safe. Deferring and skip-bit in
> >>>compaction is highly related to reclaim/compaction. Just ignoring them and (almost)
> >>>unconditionally increasing compaction priority will result in less
> >>>reclaim and less success rate on compaction.
> >>
> >>I don't see why less reclaim? Reclaim is always attempted before
> >>compaction and compaction priority doesn't affect it. And as long as
> >>reclaim wants to retry, should_compact_retry() isn't even called, so the
> >>priority stays. I wanted to change that in v1, but Michal suggested I
> >>shouldn't.
> >
> >I assume the situation that there is no !costly highorder freepage
> >because of fragmentation. In this case, should_reclaim_retry() would
> >return false since watermark cannot be met due to absence of high
> >order freepage. Now, please see should_compact_retry() with assumption
> >that there are enough order-0 free pages. Reclaim/compaction is only
> >retried two times (SYNC_LIGHT and SYNC_FULL) with your patchset since
> >compaction_withdrawn() return false with enough freepages and
> >!COMPACT_SKIPPED.
> >
> >But, before your patchset, COMPACT_PARTIAL_SKIPPED and
> >COMPACT_DEFERRED is considered as withdrawn so will retry
> >reclaim/compaction more times.
> 
> Perhaps, but it wouldn't guarantee to reach the highest priority.

Yes.

> 
> >As I said before, more reclaim (more freepage) increase migration
> >scanner's scan range and then increase compaction success probability.
> >Therefore, your patchset which makes reclaim/compaction retry less times
> >deterministically would not be safe.
> 
> After the patchset, we are guaranteed a full compaction has
> happened. If that doesn't help, yeah maybe we can try reclaiming
> more... but where to draw the line? Reclaim everything for an

To draw the line is a difficult problem. I know that. As I said before,
one of ideas is that reclaim/compaction continue until nr_reclaimed
reaches number of lru pages at beginning phase of reclaim/compaction
loop. It would not cause persistent thrashing, I guess.

> order-3 allocation just to avoid OOM, ignoring that the system might
> be thrashing heavily? Previously it also wasn't guaranteed to
> reclaim everything, but what is the optimal number of retries?

So, you say the similar logic in other thread we talked yesterday.
The fact that it wasn't guaranteed to reclaim every thing before
doesn't mean that we could relax guarantee more.

I'm not sure below is relevant to this series but just note.

I don't know the optimal number of retries. We are in a way to find
it and I hope this discussion would help. I don't think that we can
judge the point properly with simple checking on stat information at some
moment. It only has too limited knowledge about the system so it would
wrongly advise us to invoke OOM prematurely.

I think that using compaction result isn't a good way to determine if
further reclaim/compaction is useless or not because compaction result
can vary with further reclaim/compaction itself.

If we want to check more accurately if compaction is really impossible,
scanning whole range and checking arrangement of freepage and lru(movable)
pages would more help. Although there is some possibility to fail
the compaction even if this check is passed, it would give us more
information about the system state and we would invoke OOM less
prematurely. In this case that theoretically compaction success is possible,
we could keep reclaim/compaction more times even if full compaction fails
because we have a hope that more freepages would give us more compaction
success probability.

> >>
> >>>And, as a necessarily, it
> >>>would trigger OOM more frequently.
> >>
> >>OOM is only allowed for costly orders. If reclaim itself doesn't want to
> >>retry for non-costly orders anymore, and we finally start calling
> >>should_compact_retry(), then I guess the system is really struggling
> >>already and eventual OOM wouldn't be premature?
> >
> >Premature is really subjective so I don't know. Anyway, I tested
> >your patchset with simple test case and it causes a regression.
> >
> >My test setup is:
> >
> >Mem: 512 MB
> >vm.compact_unevictable_allowed = 0
> >Mlocked Mem: 225 MB by using mlock(). With some tricks, mlocked pages are
> >spread so memory is highly fragmented.
> 
> So this testcase isn't really about compaction, as that can't do
> anything even on the full priority. Actually

I missed that there are two parallel file readers. So, reclaim/compaction
actually can do something.

> compaction_zonelist_suitable() lies to us because it's not really
> suitable. Even with more memory freed by reclaim, it cannot increase
> the chances of compaction (your argument above). Reclaim can only
> free the non-mlocked pages, but compaction can also migrate those.
> 
> >fork 500
> 
> So the 500 forked processes all wait until the whole forking is done

Note that 500 isn't static value. Fragmentation ratio varies a lot
in every attempt so I should find proper value on each run.

Here is the way to find proper value.

1. make system fragmented
2. run two file readers in background.
3. set vm.highorder_retry = 1 which is my custom change to retry
reclaim/compaction endlessly for highorder allocation.
4. ./fork N
5. find proper N that doesn't invoke OOM.
6. set vm.highorder_retry = 0 to test your patchset.
7. ./fork N


js1304@ubuntu:~$ sudo sysctl -w vm.highorder_retry=1
vm.highorder_retry = 1
js1304@ubuntu:~$ time ./fork 300 0; sudo dmesg -c | grep -i -e order -e killed > tmp.dat; grep -e Killed tmp.dat | wc -l; grep -v fork tmp.dat
real    0m0.348s
user    0m0.000s
sys     0m0.252s
0
js1304@ubuntu:~$ time ./fork 300 0; sudo dmesg -c | grep -i -e order -e killed > tmp.dat; grep -e Killed tmp.dat | wc -l; grep -v fork tmp.dat
real    0m1.175s
user    0m0.000s
sys     0m0.576s
0
js1304@ubuntu:~$ time ./fork 300 0; sudo dmesg -c | grep -i -e order -e killed > tmp.dat; grep -e Killed tmp.dat | wc -l; grep -v fork tmp.dat
real    0m0.044s
user    0m0.000s
sys     0m0.036s
0
js1304@ubuntu:~$ sudo sysctl -w vm.highorder_retry=0
vm.highorder_retry = 0
js1304@ubuntu:~$ time ./fork 300 0; sudo dmesg -c | grep -i -e order -e killed > tmp.dat; grep -e Killed tmp.dat | wc -l; grep -v fork tmp.dat
real    0m0.470s
user    0m0.000s
sys     0m0.427s
18
js1304@ubuntu:~$ time ./fork 300 0; sudo dmesg -c | grep -i -e order -e killed > tmp.dat; grep -e Killed tmp.dat | wc -l; grep -v fork tmp.dat
real    0m0.710s
user    0m0.000s
sys     0m0.589s
14
js1304@ubuntu:~$ time ./fork 300 0; sudo dmesg -c | grep -i -e order -e killed > tmp.dat; grep -e Killed tmp.dat | wc -l; grep -v fork tmp.dat
real    0m0.944s
user    0m0.000s
sys     0m0.668s
27

Positive number at last line means that there are OOM killed processes
during the test.

> and only afterwards they all exit? Or they exit right after fork (or
> some delay?) I would assume the latter otherwise it would fail even

whole forked processes wait until the whole forking is done.

> before my patchset. If the non-mlocked areas don't have enough
> highorder pages for all 500 stacks, it will OOM regardless of how
> many reclaim and compaction retries. But if the processes exit
> shortly after fork, the extra retries might help making time for
> recycling the freed stacks of exited processes. But is it an useful
> workload for demonstrating the regression then?

I think so. This testcase greately pressures reclaim/compaction for
high order allocation because system memory is fragmented but there are
many reclaimable memory and many high order freepage candidates.

Thanks.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 09/17] mm, compaction: make whole_zone flag ignore cached scanner positions
  2016-07-18  9:12     ` Vlastimil Babka
@ 2016-07-19  6:44       ` Joonsoo Kim
  2016-07-19  6:54         ` Vlastimil Babka
  0 siblings, 1 reply; 37+ messages in thread
From: Joonsoo Kim @ 2016-07-19  6:44 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Andrew Morton, linux-kernel, linux-mm, Michal Hocko, Mel Gorman,
	David Rientjes, Rik van Riel

On Mon, Jul 18, 2016 at 11:12:51AM +0200, Vlastimil Babka wrote:
> On 07/06/2016 07:09 AM, Joonsoo Kim wrote:
> >On Fri, Jun 24, 2016 at 11:54:29AM +0200, Vlastimil Babka wrote:
> >>A recent patch has added whole_zone flag that compaction sets when scanning
> >>starts from the zone boundary, in order to report that zone has been fully
> >>scanned in one attempt. For allocations that want to try really hard or cannot
> >>fail, we will want to introduce a mode where scanning whole zone is guaranteed
> >>regardless of the cached positions.
> >>
> >>This patch reuses the whole_zone flag in a way that if it's already passed true
> >>to compaction, the cached scanner positions are ignored. Employing this flag
> >
> >Okay. But, please don't reset cached scanner position even if whole_zone
> >flag is set. Just set cc->migrate_pfn and free_pfn, appropriately. With
> 
> Won't that result in confusion on cached position updates during
> compaction where it checks the previous cached position? I wonder
> what kinds of corner cases it can bring...

whole_zone would come along with ignore_skip_hint so I think that
there is no problem on cached position updating.

> 
> >your following patches, whole_zone could be set without any compaction
> >try
> 
> I don't understand what you mean here? Even after whole series,
> whole_zone is only checked, and positions thus reset, after passing
> the compaction_suitable() call from compact_zone(). So at that point
> we can say that compaction is being actually tried and it's not a
> drive-by reset?

My point is that we should not initialize zone's cached pfn in case of
the whole_zone because what compaction with COMPACT_PRIO_SYNC_FULL
want is just to scan whole range. zone's cached pfn exists for
efficiency and there is no reason to initialize it by compaction with
COMPACT_PRIO_SYNC_FULL. If there are some parallel compaction users,
they could be benefit from un-initialized zone's cached pfn so I'd
like to leave them.

Thanks.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 09/17] mm, compaction: make whole_zone flag ignore cached scanner positions
  2016-07-19  6:44       ` Joonsoo Kim
@ 2016-07-19  6:54         ` Vlastimil Babka
  0 siblings, 0 replies; 37+ messages in thread
From: Vlastimil Babka @ 2016-07-19  6:54 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michal Hocko, Mel Gorman,
	David Rientjes, Rik van Riel

On 07/19/2016 08:44 AM, Joonsoo Kim wrote:
> On Mon, Jul 18, 2016 at 11:12:51AM +0200, Vlastimil Babka wrote:
>> On 07/06/2016 07:09 AM, Joonsoo Kim wrote:
>>> On Fri, Jun 24, 2016 at 11:54:29AM +0200, Vlastimil Babka wrote:
>>>> A recent patch has added whole_zone flag that compaction sets when scanning
>>>> starts from the zone boundary, in order to report that zone has been fully
>>>> scanned in one attempt. For allocations that want to try really hard or cannot
>>>> fail, we will want to introduce a mode where scanning whole zone is guaranteed
>>>> regardless of the cached positions.
>>>>
>>>> This patch reuses the whole_zone flag in a way that if it's already passed true
>>>> to compaction, the cached scanner positions are ignored. Employing this flag
>>>
>>> Okay. But, please don't reset cached scanner position even if whole_zone
>>> flag is set. Just set cc->migrate_pfn and free_pfn, appropriately. With
>>
>> Won't that result in confusion on cached position updates during
>> compaction where it checks the previous cached position? I wonder
>> what kinds of corner cases it can bring...
>
> whole_zone would come along with ignore_skip_hint so I think that
> there is no problem on cached position updating.

Right, that's true.

>>
>>> your following patches, whole_zone could be set without any compaction
>>> try
>>
>> I don't understand what you mean here? Even after whole series,
>> whole_zone is only checked, and positions thus reset, after passing
>> the compaction_suitable() call from compact_zone(). So at that point
>> we can say that compaction is being actually tried and it's not a
>> drive-by reset?
>
> My point is that we should not initialize zone's cached pfn in case of
> the whole_zone because what compaction with COMPACT_PRIO_SYNC_FULL
> want is just to scan whole range. zone's cached pfn exists for
> efficiency and there is no reason to initialize it by compaction with
> COMPACT_PRIO_SYNC_FULL. If there are some parallel compaction users,
> they could be benefit from un-initialized zone's cached pfn so I'd
> like to leave them.

I doubt they will benefit much, but OK, I'll update the patch.

> Thanks.
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 12/17] mm, compaction: more reliably increase direct compaction priority
  2016-07-19  4:53           ` Joonsoo Kim
@ 2016-07-19  7:42             ` Vlastimil Babka
  0 siblings, 0 replies; 37+ messages in thread
From: Vlastimil Babka @ 2016-07-19  7:42 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, linux-kernel, linux-mm, Michal Hocko, Mel Gorman,
	David Rientjes, Rik van Riel

On 07/19/2016 06:53 AM, Joonsoo Kim wrote:
> On Mon, Jul 18, 2016 at 02:21:02PM +0200, Vlastimil Babka wrote:
>> On 07/18/2016 06:41 AM, Joonsoo Kim wrote:
>>> On Fri, Jul 15, 2016 at 03:37:52PM +0200, Vlastimil Babka wrote:
>>>> On 07/06/2016 07:39 AM, Joonsoo Kim wrote:
>>>>> On Fri, Jun 24, 2016 at 11:54:32AM +0200, Vlastimil Babka
>>>>> wrote:
>>>>>> During reclaim/compaction loop, compaction priority can be
>>>>>> increased by the should_compact_retry() function, but the
>>>>>> current code is not optimal. Priority is only increased
>>>>>> when compaction_failed() is true, which means that
>>>>>> compaction has scanned the whole zone. This may not happen
>>>>>> even after multiple attempts with the lower priority due to
>>>>>> parallel activity, so we might needlessly struggle on the
>>>>>> lower priority and possibly run out of compaction retry 
>>>>>> attempts in the process.
>>>>>> 
>>>>>> We can remove these corner cases by increasing compaction
>>>>>> priority regardless of compaction_failed(). Examining
>>>>>> further the compaction result can be postponed only after
>>>>>> reaching the highest priority. This is a simple solution 
>>>>>> and we don't need to worry about reaching the highest
>>>>>> priority "too soon" here, because hen
>>>>>> should_compact_retry() is called it means that the system
>>>>>> is already struggling and the allocation is supposed to
>>>>>> either try as hard as possible, or it cannot fail at all.
>>>>>> There's not much point staying at lower priorities with
>>>>>> heuristics that may result in only partial compaction. Also
>>>>>> we now count compaction retries only after reaching the
>>>>>> highest priority.
>>>>> 
>>>>> I'm not sure that this patch is safe. Deferring and skip-bit
>>>>> in compaction is highly related to reclaim/compaction. Just
>>>>> ignoring them and (almost) unconditionally increasing
>>>>> compaction priority will result in less reclaim and less
>>>>> success rate on compaction.
>>>> 
>>>> I don't see why less reclaim? Reclaim is always attempted
>>>> before compaction and compaction priority doesn't affect it.
>>>> And as long as reclaim wants to retry, should_compact_retry()
>>>> isn't even called, so the priority stays. I wanted to change
>>>> that in v1, but Michal suggested I shouldn't.
>>> 
>>> I assume the situation that there is no !costly highorder
>>> freepage because of fragmentation. In this case,
>>> should_reclaim_retry() would return false since watermark cannot
>>> be met due to absence of high order freepage. Now, please see
>>> should_compact_retry() with assumption that there are enough
>>> order-0 free pages. Reclaim/compaction is only retried two times
>>> (SYNC_LIGHT and SYNC_FULL) with your patchset since 
>>> compaction_withdrawn() return false with enough freepages and 
>>> !COMPACT_SKIPPED.
>>> 
>>> But, before your patchset, COMPACT_PARTIAL_SKIPPED and 
>>> COMPACT_DEFERRED is considered as withdrawn so will retry 
>>> reclaim/compaction more times.
>> 
>> Perhaps, but it wouldn't guarantee to reach the highest priority.
> 
> Yes.

Since this is my greatest concern here, would the alternative patch at
the end of the mail work for you? Trying your test would be nice too,
but can also wait until I repost whole series (the missed watermark
checks you spotted in patch 13 could also play a role there).

> 
>> order-3 allocation just to avoid OOM, ignoring that the system
>> might be thrashing heavily? Previously it also wasn't guaranteed
>> to reclaim everything, but what is the optimal number of retries?
> 
> So, you say the similar logic in other thread we talked yesterday. 
> The fact that it wasn't guaranteed to reclaim every thing before 
> doesn't mean that we could relax guarantee more.
> 
> I'm not sure below is relevant to this series but just note.
> 
> I don't know the optimal number of retries. We are in a way to find 
> it and I hope this discussion would help. I don't think that we can 
> judge the point properly with simple checking on stat information at
> some moment. It only has too limited knowledge about the system so it
> would wrongly advise us to invoke OOM prematurely.
> 
> I think that using compaction result isn't a good way to determine
> if further reclaim/compaction is useless or not because compaction
> result can vary with further reclaim/compaction itself.

If we scan whole zone ignoring all the heuristics, and still fail, I
think it's pretty reliable (ignoring parallel activity, because then we
can indeed never be sure).

> If we want to check more accurately if compaction is really
> impossible, scanning whole range and checking arrangement of freepage
> and lru(movable) pages would more help.

But the whole zone compaction just did exactly this and failed? Sure, we
might have missed something due to the way compaction scanners meet
around the middle of zone, but that's a reason to improve the algorithm,
not to attempt more reclaim based on checks that duplicate the scanning
work.

> Although there is some possibility to fail the compaction even if 
> this check is passed, it would give us more information about the 
> system state and we would invoke OOM less prematurely. In this case 
> that theoretically compaction success is possible, we could keep 
> reclaim/compaction more times even if full compaction fails because 
> we have a hope that more freepages would give us more compaction 
> success probability.

They can only give us more probability because of a) more resilience
against parallel memory allocations getting us below low order-0
watermark during our compaction and b) we increase chances of migrate
scanner reaching higher pfn in the zone, if there are unmovable
fragmentations in the lower pfns. Both are problems to potentially
solve, and I think further tuning the decisions for reclaim/compaction
retry is just a bad workaround, and definitely not something I would
like to do in this series. So I'll try to avoid decreasing number of
retries in the patch below, but not more:

-----8<-----
>From a942ff54f7aeb2cb9cca9b868b3dde6cac90e924 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Tue, 19 Jul 2016 09:26:06 +0200
Subject: [PATCH] mm, compaction: more reliably increase direct compaction
 priority

During reclaim/compaction loop, compaction priority can be increased by the
should_compact_retry() function, but the current code is not optimal. Priority
is only increased when compaction_failed() is true, which means that compaction
has scanned the whole zone. This may not happen even after multiple attempts
with the lower priority due to parallel activity, so we might needlessly
struggle on the lower priority and possibly run out of compaction retry
attempts in the process.

After this patch we are guaranteed at least one attempt at the highest
compaction priority even if we exhaust all retries at the lower priorities.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bb9b4fb66e85..aa2580a1bcf9 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3155,13 +3155,8 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
 	 * so it doesn't really make much sense to retry except when the
 	 * failure could be caused by insufficient priority
 	 */
-	if (compaction_failed(compact_result)) {
-		if (*compact_priority > MIN_COMPACT_PRIORITY) {
-			(*compact_priority)--;
-			return true;
-		}
-		return false;
-	}
+	if (compaction_failed(compact_result))
+		goto check_priority;
 
 	/*
 	 * make sure the compaction wasn't deferred or didn't bail out early
@@ -3185,6 +3180,15 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
 	if (compaction_retries <= max_retries)
 		return true;
 
+	/* 
+	 * Make sure there is at least one attempt at the highest priority
+	 * if we exhausted all retries at the lower priorities
+	 */
+check_priority:
+	if (*compact_priority > MIN_COMPACT_PRIORITY) {
+		(*compact_priority)--;
+		return true;
+	}
 	return false;
 }
 #else
-- 
2.9.0

^ permalink raw reply related	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2016-07-19  7:42 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-24  9:54 [PATCH v3 00/17] make direct compaction more deterministic Vlastimil Babka
2016-06-24  9:54 ` [PATCH v3 01/17] mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode Vlastimil Babka
2016-06-24  9:54 ` [PATCH v3 02/17] mm, page_alloc: set alloc_flags only once in slowpath Vlastimil Babka
2016-06-30 14:44   ` Michal Hocko
2016-06-24  9:54 ` [PATCH v3 03/17] mm, page_alloc: don't retry initial attempt " Vlastimil Babka
2016-06-30 15:03   ` Michal Hocko
2016-06-24  9:54 ` [PATCH v3 04/17] mm, page_alloc: restructure direct compaction handling " Vlastimil Babka
2016-06-24  9:54 ` [PATCH v3 05/17] mm, page_alloc: make THP-specific decisions more generic Vlastimil Babka
2016-06-24  9:54 ` [PATCH v3 06/17] mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations Vlastimil Babka
2016-06-24  9:54 ` [PATCH v3 07/17] mm, compaction: introduce direct compaction priority Vlastimil Babka
2016-06-24 11:39   ` kbuild test robot
2016-06-24 11:51     ` Vlastimil Babka
2016-06-24  9:54 ` [PATCH v3 08/17] mm, compaction: simplify contended compaction handling Vlastimil Babka
2016-06-24  9:54 ` [PATCH v3 09/17] mm, compaction: make whole_zone flag ignore cached scanner positions Vlastimil Babka
2016-07-06  5:09   ` Joonsoo Kim
2016-07-18  9:12     ` Vlastimil Babka
2016-07-19  6:44       ` Joonsoo Kim
2016-07-19  6:54         ` Vlastimil Babka
2016-06-24  9:54 ` [PATCH v3 10/17] mm, compaction: cleanup unused functions Vlastimil Babka
2016-06-24 11:53   ` Vlastimil Babka
2016-06-24  9:54 ` [PATCH v3 11/17] mm, compaction: add the ultimate direct compaction priority Vlastimil Babka
2016-06-24  9:54 ` [PATCH v3 12/17] mm, compaction: more reliably increase " Vlastimil Babka
2016-07-06  5:39   ` Joonsoo Kim
2016-07-15 13:37     ` Vlastimil Babka
2016-07-18  4:41       ` Joonsoo Kim
2016-07-18 12:21         ` Vlastimil Babka
2016-07-19  4:53           ` Joonsoo Kim
2016-07-19  7:42             ` Vlastimil Babka
2016-06-24  9:54 ` [PATCH v3 13/17] mm, compaction: use correct watermark when checking allocation success Vlastimil Babka
2016-07-06  5:47   ` Joonsoo Kim
2016-07-18  9:23     ` Vlastimil Babka
2016-06-24  9:54 ` [PATCH v3 14/17] mm, compaction: create compact_gap wrapper Vlastimil Babka
2016-06-24  9:54 ` [PATCH v3 15/17] mm, compaction: use proper alloc_flags in __compaction_suitable() Vlastimil Babka
2016-06-24  9:54 ` [PATCH v3 16/17] mm, compaction: require only min watermarks for non-costly orders Vlastimil Babka
2016-06-24  9:54 ` [PATCH v3 17/17] mm, vmscan: make compaction_ready() more accurate and readable Vlastimil Babka
2016-07-06  5:55   ` Joonsoo Kim
2016-07-18 11:48     ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).