linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC 00/13] make direct compaction more deterministic
@ 2016-05-10  7:35 Vlastimil Babka
  2016-05-10  7:35 ` [RFC 01/13] mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode Vlastimil Babka
                   ` (13 more replies)
  0 siblings, 14 replies; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-10  7:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds, Vlastimil Babka

This is mostly a followup to Michal's oom detection rework, which highlighted
the need for direct compaction to provide better feedback in reclaim/compaction
loop, so that it can reliably recognize when compaction cannot make further
progress, and allocation should invoke OOM killer or fail. We've discussed
this at LSF/MM [1] where I proposed expanding the async/sync migration mode
used in compaction to more general "priorities". This patchset adds one new
priority that just overrides all the heuristics and makes compaction fully
scan all zones. I don't currently think that we need more fine-grained
priorities, but we'll see. Other than that there's some smaller fixes and
cleanups, mainly related to the THP-specific hacks.

Testing/evaluation is pending, but I'm posting it now with hope to help the
discussions around oom detection rework. I also hope for testing the near-OOM
conditions, and the new priority level should also help hugetlbfs allocations
since they use __GFP_RETRY, and it has already been reported that ignoring
compaction heuristics helps these allocations.

The series is based on mmotm git [2] tag mmotm-2016-04-27-15-21-14.
First one needs to git revert the commit 69340d225e8d ("mm: use compaction
feedback for thp backoff conditions") which already happened in mmots.

[1] https://lwn.net/Articles/684611/
[2] git://git.kernel.org/pub/scm/linux/kernel/git/mhocko/mm.git

Hugh Dickins (1):
  mm, compaction: don't isolate PageWriteback pages in
    MIGRATE_SYNC_LIGHT mode

Vlastimil Babka (12):
  mm, page_alloc: set alloc_flags only once in slowpath
  mm, page_alloc: don't retry initial attempt in slowpath
  mm, page_alloc: restructure direct compaction handling in slowpath
  mm, page_alloc: make THP-specific decisions more generic
  mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
  mm, compaction: introduce direct compaction priority
  mm, compaction: simplify contended compaction handling
  mm, compaction: make whole_zone flag ignore cached scanner positions
  mm, compaction: cleanup unused functions
  mm, compaction: add the ultimate direct compaction priority
  mm, compaction: more reliably increase direct compaction priority
  mm, compaction: fix and improve watermark handling

 include/linux/compaction.h |  32 +++----
 include/linux/gfp.h        |   3 +-
 include/linux/mm.h         |   2 +-
 mm/compaction.c            | 196 +++++++++++++++-------------------------
 mm/huge_memory.c           |   8 +-
 mm/internal.h              |  10 +--
 mm/page_alloc.c            | 220 +++++++++++++++++++++------------------------
 mm/page_isolation.c        |   2 +-
 8 files changed, 196 insertions(+), 277 deletions(-)

-- 
2.8.2

^ permalink raw reply	[flat|nested] 61+ messages in thread

* [RFC 01/13] mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode
  2016-05-10  7:35 [RFC 00/13] make direct compaction more deterministic Vlastimil Babka
@ 2016-05-10  7:35 ` Vlastimil Babka
  2016-05-11 12:40   ` Michal Hocko
  2016-05-10  7:35 ` [RFC 02/13] mm, page_alloc: set alloc_flags only once in slowpath Vlastimil Babka
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-10  7:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds, Hugh Dickins, Vlastimil Babka

From: Hugh Dickins <hughd@google.com>

At present MIGRATE_SYNC_LIGHT is allowing __isolate_lru_page() to
isolate a PageWriteback page, which __unmap_and_move() then rejects
with -EBUSY: of course the writeback might complete in between, but
that's not what we usually expect, so probably better not to isolate it.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/compaction.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index c72987603343..481004c73c90 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1146,7 +1146,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
 	struct page *page;
 	const isolate_mode_t isolate_mode =
 		(sysctl_compact_unevictable_allowed ? ISOLATE_UNEVICTABLE : 0) |
-		(cc->mode == MIGRATE_ASYNC ? ISOLATE_ASYNC_MIGRATE : 0);
+		(cc->mode != MIGRATE_SYNC ? ISOLATE_ASYNC_MIGRATE : 0);
 
 	/*
 	 * Start at where we last stopped, or beginning of the zone as
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC 02/13] mm, page_alloc: set alloc_flags only once in slowpath
  2016-05-10  7:35 [RFC 00/13] make direct compaction more deterministic Vlastimil Babka
  2016-05-10  7:35 ` [RFC 01/13] mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode Vlastimil Babka
@ 2016-05-10  7:35 ` Vlastimil Babka
  2016-05-10 11:28   ` Tetsuo Handa
  2016-05-10  7:35 ` [RFC 03/13] mm, page_alloc: don't retry initial attempt " Vlastimil Babka
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-10  7:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds, Vlastimil Babka

In __alloc_pages_slowpath(), alloc_flags doesn't change after it's initialized,
so move the initialization above the retry: label. Also make the comment above
the initialization more descriptive.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a50184ec6ca0..91fbf6f95403 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3579,17 +3579,17 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
 		gfp_mask &= ~__GFP_ATOMIC;
 
-retry:
-	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
-		wake_all_kswapds(order, ac);
-
 	/*
-	 * OK, we're below the kswapd watermark and have kicked background
-	 * reclaim. Now things get more complex, so set up alloc_flags according
-	 * to how we want to proceed.
+	 * The fast path uses conservative alloc_flags to succeed only until
+	 * kswapd needs to be woken up, and to avoid the cost of setting up
+	 * alloc_flags precisely. So we do that now.
 	 */
 	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
+retry:
+	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
+		wake_all_kswapds(order, ac);
+
 	/* This is the last chance, in general, before the goto nopage. */
 	page = get_page_from_freelist(gfp_mask, order,
 				alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC 03/13] mm, page_alloc: don't retry initial attempt in slowpath
  2016-05-10  7:35 [RFC 00/13] make direct compaction more deterministic Vlastimil Babka
  2016-05-10  7:35 ` [RFC 01/13] mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode Vlastimil Babka
  2016-05-10  7:35 ` [RFC 02/13] mm, page_alloc: set alloc_flags only once in slowpath Vlastimil Babka
@ 2016-05-10  7:35 ` Vlastimil Babka
  2016-05-12 12:48   ` Michal Hocko
  2016-05-31  6:25   ` Joonsoo Kim
  2016-05-10  7:35 ` [RFC 04/13] mm, page_alloc: restructure direct compaction handling " Vlastimil Babka
                   ` (10 subsequent siblings)
  13 siblings, 2 replies; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-10  7:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds, Vlastimil Babka

After __alloc_pages_slowpath() sets up new alloc_flags and wakes up kswapd, it
first tries get_page_from_freelist() with the new alloc_flags, as it may
succeed e.g. due to using min watermark instead of low watermark. This attempt
does not have to be retried on each loop, since direct reclaim, direct
compaction and oom call get_page_from_freelist() themselves.

This patch therefore moves the initial attempt above the retry label. The
ALLOC_NO_WATERMARKS attempt is kept under retry label as it's special and
should be retried after each loop. Kswapd wakeups are also done on each retry
to be safe from potential races resulting in kswapd going to sleep while a
process (that may not be able to reclaim by itself) is still looping.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 91fbf6f95403..7249949d65ca 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3586,16 +3586,23 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 */
 	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
-retry:
 	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
 		wake_all_kswapds(order, ac);
 
-	/* This is the last chance, in general, before the goto nopage. */
+	/*
+	 * The adjusted alloc_flags might result in immediate success, so try
+	 * that first
+	 */
 	page = get_page_from_freelist(gfp_mask, order,
 				alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
 	if (page)
 		goto got_pg;
 
+retry:
+	/* Ensure kswapd doesn't accidentaly go to sleep as long as we loop */
+	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
+		wake_all_kswapds(order, ac);
+
 	/* Allocate without watermarks if the context allows */
 	if (alloc_flags & ALLOC_NO_WATERMARKS) {
 		/*
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC 04/13] mm, page_alloc: restructure direct compaction handling in slowpath
  2016-05-10  7:35 [RFC 00/13] make direct compaction more deterministic Vlastimil Babka
                   ` (2 preceding siblings ...)
  2016-05-10  7:35 ` [RFC 03/13] mm, page_alloc: don't retry initial attempt " Vlastimil Babka
@ 2016-05-10  7:35 ` Vlastimil Babka
  2016-05-12 13:29   ` Michal Hocko
  2016-05-10  7:35 ` [RFC 05/13] mm, page_alloc: make THP-specific decisions more generic Vlastimil Babka
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-10  7:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds, Vlastimil Babka

The retry loop in __alloc_pages_slowpath is supposed to keep trying reclaim
and compaction (and OOM), until either the allocation succeeds, or returns
with failure. Success here is more probable when reclaim precedes compaction,
as certain watermarks have to be met for compaction to even try, and more free
pages increase the probability of compaction success. On the other hand,
starting with light async compaction (if the watermarks allow it), can be
more efficient, especially for smaller orders, if there's enough free memory
which is just fragmented.

Thus, the current code starts with compaction before reclaim, and to make sure
that the last reclaim is always followed by a final compaction, there's another
direct compaction call at the end of the loop. This makes the code hard to
follow and adds some duplicated handling of migration_mode decisions. It's also
somewhat inefficient that even if reclaim or compaction decides not to retry,
the final compaction is still attempted. Some gfp flags combination also
shortcut these retry decisions by "goto noretry;", making it even harder to
follow.

This patch attempts to restructure the code with only minimal functional
changes. The call to the first compaction and THP-specific checks are now
placed above the retry loop, and the "noretry" direct compaction is removed.

The initial compaction is additionally restricted only to costly orders, as we
can expect smaller orders to be held back by watermarks, and only larger orders
to suffer primarily from fragmentation. This better matches the checks in
reclaim's shrink_zones().

There are two other smaller functional changes. One is that the upgrade from
async migration to light sync migration will always occur after the initial
compaction. This is how it has been until recent patch "mm, oom: protect
!costly allocations some more", which introduced upgrading the mode based on
COMPACT_COMPLETE result, but kept the final compaction always upgraded, which
made it even more special. It's better to return to the simpler handling for
now, as migration modes will be further modified later in the series.

The second change is that once both reclaim and compaction declare it's not
worth to retry the reclaim/compact loop, there is no final compaction attempt.
As argued above, this is intentional. If that final compaction were to succeed,
it would be due to a wrong retry decision, or simply a race with somebody else
freeing memory for us.

The main outcome of this patch should be simpler code. Logically, the initial
compaction without reclaim is the exceptional case to the reclaim/compaction
scheme, but prior to the patch, it was the last loop iteration that was
exceptional. Now the code matches the logic better. The change also enable the
following patches.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 107 +++++++++++++++++++++++++++++---------------------------
 1 file changed, 55 insertions(+), 52 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 7249949d65ca..88d680b3e7b6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3555,7 +3555,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct page *page = NULL;
 	unsigned int alloc_flags;
 	unsigned long did_some_progress;
-	enum migrate_mode migration_mode = MIGRATE_ASYNC;
+	enum migrate_mode migration_mode = MIGRATE_SYNC_LIGHT;
 	enum compact_result compact_result;
 	int compaction_retries = 0;
 	int no_progress_loops = 0;
@@ -3598,6 +3598,50 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (page)
 		goto got_pg;
 
+	/*
+	 * For costly allocations, try direct compaction first, as it's likely
+	 * that we have enough base pages and don't need to reclaim.
+	 */
+	if (can_direct_reclaim && order > PAGE_ALLOC_COSTLY_ORDER) {
+		page = __alloc_pages_direct_compact(gfp_mask, order,
+						alloc_flags, ac,
+						MIGRATE_ASYNC,
+						&compact_result);
+		if (page)
+			goto got_pg;
+
+		/* Checks for THP-specific high-order allocations */
+		if (is_thp_gfp_mask(gfp_mask)) {
+			/*
+			 * If compaction is deferred for high-order allocations,
+			 * it is because sync compaction recently failed. If
+			 * this is the case and the caller requested a THP
+			 * allocation, we do not want to heavily disrupt the
+			 * system, so we fail the allocation instead of entering
+			 * direct reclaim.
+			 */
+			if (compact_result == COMPACT_DEFERRED)
+				goto nopage;
+
+			/*
+			 * Compaction is contended so rather back off than cause
+			 * excessive stalls.
+			 */
+			if (compact_result == COMPACT_CONTENDED)
+				goto nopage;
+
+			/*
+			 * It can become very expensive to allocate transparent
+			 * hugepages at fault, so use asynchronous memory
+			 * compaction for THP unless it is khugepaged trying to
+			 * collapse. All other requests should tolerate at
+			 * least light sync migration.
+			 */
+			if (!(current->flags & PF_KTHREAD))
+				migration_mode = MIGRATE_ASYNC;
+		}
+	}
+
 retry:
 	/* Ensure kswapd doesn't accidentaly go to sleep as long as we loop */
 	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
@@ -3646,55 +3690,33 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
 		goto nopage;
 
-	/*
-	 * Try direct compaction. The first pass is asynchronous. Subsequent
-	 * attempts after direct reclaim are synchronous
-	 */
+
+	/* Try direct reclaim and then allocating */
+	page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
+							&did_some_progress);
+	if (page)
+		goto got_pg;
+
+	/* Try direct compaction and then allocating */
 	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
 					migration_mode,
 					&compact_result);
 	if (page)
 		goto got_pg;
 
-	/* Checks for THP-specific high-order allocations */
-	if (is_thp_gfp_mask(gfp_mask)) {
-		/*
-		 * If compaction is deferred for high-order allocations, it is
-		 * because sync compaction recently failed. If this is the case
-		 * and the caller requested a THP allocation, we do not want
-		 * to heavily disrupt the system, so we fail the allocation
-		 * instead of entering direct reclaim.
-		 */
-		if (compact_result == COMPACT_DEFERRED)
-			goto nopage;
-
-		/*
-		 * Compaction is contended so rather back off than cause
-		 * excessive stalls.
-		 */
-		if(compact_result == COMPACT_CONTENDED)
-			goto nopage;
-	}
-
 	if (order && compaction_made_progress(compact_result))
 		compaction_retries++;
 
-	/* Try direct reclaim and then allocating */
-	page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
-							&did_some_progress);
-	if (page)
-		goto got_pg;
-
 	/* Do not loop if specifically requested */
 	if (gfp_mask & __GFP_NORETRY)
-		goto noretry;
+		goto nopage;
 
 	/*
 	 * Do not retry costly high order allocations unless they are
 	 * __GFP_REPEAT
 	 */
 	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
-		goto noretry;
+		goto nopage;
 
 	/*
 	 * Costly allocations might have made a progress but this doesn't mean
@@ -3733,25 +3755,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		goto retry;
 	}
 
-noretry:
-	/*
-	 * High-order allocations do not necessarily loop after direct reclaim
-	 * and reclaim/compaction depends on compaction being called after
-	 * reclaim so call directly if necessary.
-	 * It can become very expensive to allocate transparent hugepages at
-	 * fault, so use asynchronous memory compaction for THP unless it is
-	 * khugepaged trying to collapse. All other requests should tolerate
-	 * at least light sync migration.
-	 */
-	if (is_thp_gfp_mask(gfp_mask) && !(current->flags & PF_KTHREAD))
-		migration_mode = MIGRATE_ASYNC;
-	else
-		migration_mode = MIGRATE_SYNC_LIGHT;
-	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
-					    ac, migration_mode,
-					    &compact_result);
-	if (page)
-		goto got_pg;
 nopage:
 	warn_alloc_failed(gfp_mask, order, NULL);
 got_pg:
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC 05/13] mm, page_alloc: make THP-specific decisions more generic
  2016-05-10  7:35 [RFC 00/13] make direct compaction more deterministic Vlastimil Babka
                   ` (3 preceding siblings ...)
  2016-05-10  7:35 ` [RFC 04/13] mm, page_alloc: restructure direct compaction handling " Vlastimil Babka
@ 2016-05-10  7:35 ` Vlastimil Babka
  2016-05-12 13:43   ` Michal Hocko
  2016-05-10  7:35 ` [RFC 06/13] mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations Vlastimil Babka
                   ` (8 subsequent siblings)
  13 siblings, 1 reply; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-10  7:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds, Vlastimil Babka

Since THP allocations during page faults can be costly, extra decisions are
employed for them to avoid excessive reclaim and compaction, if the initial
compaction doesn't look promising. The detection has never been perfect as
there is no gfp flag specific to THP allocations. At this moment it checks the
whole combination of flags that makes up GFP_TRANSHUGE, and hopes that no other
users of such combination exist, or would mind being treated the same way.
Extra care is also taken to separate allocations from khugepaged, where latency
doesn't matter that much.

It is however possible to distinguish these allocations in a simpler and more
reliable way. The key observation is that after the initial compaction followed
by the first iteration of "standard" reclaim/compaction, both __GFP_NORETRY
allocations and costly allocations without __GFP_REPEAT are declared as
failures:

        /* Do not loop if specifically requested */
        if (gfp_mask & __GFP_NORETRY)
                goto nopage;

        /*
         * Do not retry costly high order allocations unless they are
         * __GFP_REPEAT
         */
        if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
                goto nopage;

This means we can further distinguish allocations that are costly order *and*
additionally include the __GFP_NORETRY flag. As it happens, GFP_TRANSHUGE
allocations do already fall into this category. This will also allow other
costly allocations with similar high-order benefit vs latency considerations to
use this semantic. Furthermore, we can distinguish THP allocations that should
try a bit harder (such as from khugepageed) by removing __GFP_NORETRY, as will
be done in the next patch.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 22 +++++++++-------------
 1 file changed, 9 insertions(+), 13 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 88d680b3e7b6..f5d931e0854a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3182,7 +3182,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 	return page;
 }
 
-
 /*
  * Maximum number of compaction retries wit a progress before OOM
  * killer is consider as the only way to move forward.
@@ -3447,11 +3446,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
 }
 
-static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
-{
-	return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
-}
-
 /*
  * Maximum number of reclaim retries without any progress before OOM killer
  * is consider as the only way to move forward.
@@ -3610,8 +3604,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 		if (page)
 			goto got_pg;
 
-		/* Checks for THP-specific high-order allocations */
-		if (is_thp_gfp_mask(gfp_mask)) {
+		/*
+		 * Checks for costly allocations with __GFP_NORETRY, which
+		 * includes THP page fault allocations
+		 */
+		if (gfp_mask & __GFP_NORETRY) {
 			/*
 			 * If compaction is deferred for high-order allocations,
 			 * it is because sync compaction recently failed. If
@@ -3631,11 +3628,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				goto nopage;
 
 			/*
-			 * It can become very expensive to allocate transparent
-			 * hugepages at fault, so use asynchronous memory
-			 * compaction for THP unless it is khugepaged trying to
-			 * collapse. All other requests should tolerate at
-			 * least light sync migration.
+			 * Looks like reclaim/compaction is worth trying, but
+			 * sync compaction could be very expensive, so keep
+			 * using async compaction, unless it's khugepaged
+			 * trying to collapse.
 			 */
 			if (!(current->flags & PF_KTHREAD))
 				migration_mode = MIGRATE_ASYNC;
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC 06/13] mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
  2016-05-10  7:35 [RFC 00/13] make direct compaction more deterministic Vlastimil Babka
                   ` (4 preceding siblings ...)
  2016-05-10  7:35 ` [RFC 05/13] mm, page_alloc: make THP-specific decisions more generic Vlastimil Babka
@ 2016-05-10  7:35 ` Vlastimil Babka
  2016-05-12 16:20   ` Michal Hocko
  2016-05-10  7:35 ` [RFC 07/13] mm, compaction: introduce direct compaction priority Vlastimil Babka
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-10  7:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds, Vlastimil Babka

After the previous patch, we can distinguish costly allocations that should be
really lightweight, such as THP page faults, with __GFP_NORETRY. This means we
don't need to recognize khugepaged allocations via PF_KTHREAD anymore. We can
also change THP page faults in areas where madvise(MADV_HUGEPAGE) was used to
try as hard as khugepaged, as the process has indicated that it benefits from
THP's and is willing to pay some initial latency costs.

This is implemented by removing __GFP_NORETRY from GFP_TRANSHUGE and applying
it selectively for current GFP_TRANSHUGE users:

* get_huge_zero_page() - the zero page lifetime should be relatively long and
  it's shared by multiple users, so it's worth spending some effort on it.
  __GFP_NORETRY is not added

* alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency is not
  an issue. So if khugepaged "defrag" is enabled (the default), do reclaim
  without __GFP_NORETRY. We can remove the PF_KTHREAD check from page alloc.
  As a side-effect, khugepaged will now no longer check if the initial
  compaction was deferred or contended. This is OK, as khugepaged sleep times
  between collapsion attemps are long enough to prevent noticeable disruption,
  so we should allow it to spend some effort.

* migrate_misplaced_transhuge_page() - already does ~__GFP_RECLAIM, so
  removing __GFP_NORETRY has no effect here

* alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise) are
  now allocating without __GFP_NORETRY. Other vma's keep using __GFP_NORETRY
  if direct reclaim/compaction is at all allowed (by default it's allowed only
  for VM_HUGEPAGE vma's)

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/gfp.h | 3 +--
 mm/huge_memory.c    | 8 +++++---
 mm/page_alloc.c     | 6 ++----
 3 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 570383a41853..0cb09714d960 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -256,8 +256,7 @@ struct vm_area_struct;
 #define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
 #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
 #define GFP_TRANSHUGE	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
-			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
-			 ~__GFP_RECLAIM)
+			 __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)
 
 /* Convert GFP flags to their corresponding migrate type */
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index a69e1e144050..30a254a5e780 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -882,9 +882,10 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 }
 
 /*
- * If THP is set to always then directly reclaim/compact as necessary
- * If set to defer then do no reclaim and defer to khugepaged
+ * If THP defrag is set to always then directly reclaim/compact as necessary
+ * If set to defer then do only background reclaim/compact and defer to khugepaged
  * If set to madvise and the VMA is flagged then directly reclaim/compact
+ * When direct reclaim/compact is allowed, try a bit harder for flagged VMA's
  */
 static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 {
@@ -896,7 +897,8 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
 		reclaim_flags = __GFP_KSWAPD_RECLAIM;
 	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		reclaim_flags = __GFP_DIRECT_RECLAIM;
+		reclaim_flags = __GFP_DIRECT_RECLAIM |
+					((vma->vm_flags & VM_HUGEPAGE) ? 0 : __GFP_NORETRY);
 
 	return GFP_TRANSHUGE | reclaim_flags;
 }
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f5d931e0854a..1a5ff4525a0e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3630,11 +3630,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 			/*
 			 * Looks like reclaim/compaction is worth trying, but
 			 * sync compaction could be very expensive, so keep
-			 * using async compaction, unless it's khugepaged
-			 * trying to collapse.
+			 * using async compaction.
 			 */
-			if (!(current->flags & PF_KTHREAD))
-				migration_mode = MIGRATE_ASYNC;
+			migration_mode = MIGRATE_ASYNC;
 		}
 	}
 
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC 07/13] mm, compaction: introduce direct compaction priority
  2016-05-10  7:35 [RFC 00/13] make direct compaction more deterministic Vlastimil Babka
                   ` (5 preceding siblings ...)
  2016-05-10  7:35 ` [RFC 06/13] mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations Vlastimil Babka
@ 2016-05-10  7:35 ` Vlastimil Babka
  2016-05-13 12:37   ` Michal Hocko
  2016-05-10  7:35 ` [RFC 08/13] mm, compaction: simplify contended compaction handling Vlastimil Babka
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-10  7:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds, Vlastimil Babka

In the context of direct compaction, for some types of allocations we would
like the compaction to either succeed or definitely fail while trying as hard
as possible. Current async/sync_light migration mode is insufficient, as there
are heuristics such as caching scanner positions, marking pageblocks as
unsuitable or deferring compaction for a zone. At least the final compaction
attempt should be able to override these heuristics.

To communicate how hard compaction should try, we replace migration mode with
a new enum compact_priority and change the relevant function signatures. In
compact_zone_order() where struct compact_control is constructed, the priority
is mapped to suitable control flags. This patch itself has no functional
change, as the current priority levels are mapped back to the same migration
modes as before. Expanding them will be done next.

Note that !CONFIG_COMPACTION variant of try_to_compact_pages() is removed, as
the only caller exists under CONFIG_COMPACTION.
---
 include/linux/compaction.h | 18 +++++++++---------
 mm/compaction.c            | 14 ++++++++------
 mm/page_alloc.c            | 27 +++++++++++++--------------
 3 files changed, 30 insertions(+), 29 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 4ba90e74969c..900d181ff1b0 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -1,6 +1,14 @@
 #ifndef _LINUX_COMPACTION_H
 #define _LINUX_COMPACTION_H
 
+// TODO: lower value means higher priority to match reclaim, makes sense?
+enum compact_priority {
+	COMPACT_PRIO_SYNC_LIGHT,
+	DEF_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
+	COMPACT_PRIO_ASYNC,
+	INIT_COMPACT_PRIORITY = COMPACT_PRIO_ASYNC
+};
+
 /* Return values for compact_zone() and try_to_compact_pages() */
 /* When adding new states, please adjust include/trace/events/compaction.h */
 enum compact_result {
@@ -66,7 +74,7 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
 			unsigned int order,
 			unsigned int alloc_flags, const struct alloc_context *ac,
-			enum migrate_mode mode, int *contended);
+			enum compact_priority prio, int *contended);
 extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern enum compact_result compaction_suitable(struct zone *zone, int order,
@@ -151,14 +159,6 @@ extern void kcompactd_stop(int nid);
 extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
 
 #else
-static inline enum compact_result try_to_compact_pages(gfp_t gfp_mask,
-			unsigned int order, unsigned int alloc_flags,
-			const struct alloc_context *ac,
-			enum migrate_mode mode, int *contended)
-{
-	return COMPACT_CONTINUE;
-}
-
 static inline void compact_pgdat(pg_data_t *pgdat, int order)
 {
 }
diff --git a/mm/compaction.c b/mm/compaction.c
index 481004c73c90..abfd71e1f1a3 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1571,7 +1571,7 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
 }
 
 static enum compact_result compact_zone_order(struct zone *zone, int order,
-		gfp_t gfp_mask, enum migrate_mode mode, int *contended,
+		gfp_t gfp_mask, enum compact_priority prio, int *contended,
 		unsigned int alloc_flags, int classzone_idx)
 {
 	enum compact_result ret;
@@ -1581,7 +1581,8 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
 		.order = order,
 		.gfp_mask = gfp_mask,
 		.zone = zone,
-		.mode = mode,
+		.mode = (prio == COMPACT_PRIO_ASYNC) ?
+					MIGRATE_ASYNC :	MIGRATE_SYNC_LIGHT,
 		.alloc_flags = alloc_flags,
 		.classzone_idx = classzone_idx,
 		.direct_compaction = true,
@@ -1614,7 +1615,7 @@ int sysctl_extfrag_threshold = 500;
  */
 enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 			unsigned int alloc_flags, const struct alloc_context *ac,
-			enum migrate_mode mode, int *contended)
+			enum compact_priority prio, int *contended)
 {
 	int may_enter_fs = gfp_mask & __GFP_FS;
 	int may_perform_io = gfp_mask & __GFP_IO;
@@ -1629,7 +1630,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 	if (!order || !may_enter_fs || !may_perform_io)
 		return COMPACT_SKIPPED;
 
-	trace_mm_compaction_try_to_compact_pages(order, gfp_mask, mode);
+	//XXX: FIXME
+	//trace_mm_compaction_try_to_compact_pages(order, gfp_mask, mode);
 
 	/* Compact each zone in the list */
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
@@ -1642,7 +1644,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 			continue;
 		}
 
-		status = compact_zone_order(zone, order, gfp_mask, mode,
+		status = compact_zone_order(zone, order, gfp_mask, prio,
 				&zone_contended, alloc_flags,
 				ac_classzone_idx(ac));
 		rc = max(status, rc);
@@ -1676,7 +1678,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 			goto break_loop;
 		}
 
-		if (mode != MIGRATE_ASYNC && (status == COMPACT_COMPLETE ||
+		if (prio != COMPACT_PRIO_ASYNC && (status == COMPACT_COMPLETE ||
 					status == COMPACT_PARTIAL_SKIPPED)) {
 			/*
 			 * We think that allocation won't succeed in this zone
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1a5ff4525a0e..17abc05be972 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3193,7 +3193,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
 static struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		unsigned int alloc_flags, const struct alloc_context *ac,
-		enum migrate_mode mode, enum compact_result *compact_result)
+		enum compact_priority prio, enum compact_result *compact_result)
 {
 	struct page *page;
 	int contended_compaction;
@@ -3203,7 +3203,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 
 	current->flags |= PF_MEMALLOC;
 	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
-						mode, &contended_compaction);
+						prio, &contended_compaction);
 	current->flags &= ~PF_MEMALLOC;
 
 	if (*compact_result <= COMPACT_INACTIVE)
@@ -3258,7 +3258,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 
 static inline bool
 should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
-		     enum compact_result compact_result, enum migrate_mode *migrate_mode,
+		     enum compact_result compact_result, enum compact_priority *compact_priority,
 		     int compaction_retries)
 {
 	int max_retries = MAX_COMPACT_RETRIES;
@@ -3269,11 +3269,11 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
 	/*
 	 * compaction considers all the zone as desperately out of memory
 	 * so it doesn't really make much sense to retry except when the
-	 * failure could be caused by weak migration mode.
+	 * failure could be caused by insufficient priority
 	 */
 	if (compaction_failed(compact_result)) {
-		if (*migrate_mode == MIGRATE_ASYNC) {
-			*migrate_mode = MIGRATE_SYNC_LIGHT;
+		if (*compact_priority > 0) {
+			(*compact_priority)--;
 			return true;
 		}
 		return false;
@@ -3307,7 +3307,7 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
 static inline struct page *
 __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		unsigned int alloc_flags, const struct alloc_context *ac,
-		enum migrate_mode mode, enum compact_result *compact_result)
+		enum compact_priority prio, enum compact_result *compact_result)
 {
 	return NULL;
 }
@@ -3315,7 +3315,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 static inline bool
 should_compact_retry(struct alloc_context *ac, unsigned int order, int alloc_flags,
 		     enum compact_result compact_result,
-		     enum migrate_mode *migrate_mode,
+		     enum compact_priority *compact_priority,
 		     int compaction_retries)
 {
 	return false;
@@ -3549,7 +3549,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	struct page *page = NULL;
 	unsigned int alloc_flags;
 	unsigned long did_some_progress;
-	enum migrate_mode migration_mode = MIGRATE_SYNC_LIGHT;
+	enum compact_priority compact_priority = DEF_COMPACT_PRIORITY;
 	enum compact_result compact_result;
 	int compaction_retries = 0;
 	int no_progress_loops = 0;
@@ -3599,7 +3599,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	if (can_direct_reclaim && order > PAGE_ALLOC_COSTLY_ORDER) {
 		page = __alloc_pages_direct_compact(gfp_mask, order,
 						alloc_flags, ac,
-						MIGRATE_ASYNC,
+						INIT_COMPACT_PRIORITY,
 						&compact_result);
 		if (page)
 			goto got_pg;
@@ -3632,7 +3632,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 			 * sync compaction could be very expensive, so keep
 			 * using async compaction.
 			 */
-			migration_mode = MIGRATE_ASYNC;
+			compact_priority = INIT_COMPACT_PRIORITY;
 		}
 	}
 
@@ -3693,8 +3693,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 
 	/* Try direct compaction and then allocating */
 	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
-					migration_mode,
-					&compact_result);
+					compact_priority, &compact_result);
 	if (page)
 		goto got_pg;
 
@@ -3734,7 +3733,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	 */
 	if (did_some_progress > 0 &&
 			should_compact_retry(ac, order, alloc_flags,
-				compact_result, &migration_mode,
+				compact_result, &compact_priority,
 				compaction_retries))
 		goto retry;
 
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC 08/13] mm, compaction: simplify contended compaction handling
  2016-05-10  7:35 [RFC 00/13] make direct compaction more deterministic Vlastimil Babka
                   ` (6 preceding siblings ...)
  2016-05-10  7:35 ` [RFC 07/13] mm, compaction: introduce direct compaction priority Vlastimil Babka
@ 2016-05-10  7:35 ` Vlastimil Babka
  2016-05-13 13:09   ` Michal Hocko
  2016-05-10  7:35 ` [RFC 09/13] mm, compaction: make whole_zone flag ignore cached scanner positions Vlastimil Babka
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-10  7:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds, Vlastimil Babka

Async compaction detects contention either due to failing trylock on zone->lock
or lru_lock, or by need_resched(). Since 1f9efdef4f3f ("mm, compaction:
khugepaged should not give up due to need_resched()") the code got quite
complicated to distinguish these two up to the __alloc_pages_slowpath() level,
so different decisions could be taken for khugepaged allocations.

After the recent changes, khugepaged allocations don't check for contended
compaction anymore, so we again don't need to distinguish lock and sched
contention, and simplify the current convoluted code a lot.

However, I believe it's also possible to simplify even more and completely
remove the check for contended compaction after the initial async compaction
for costly orders, which was originally aimed at THP page fault allocations.
There are several reasons why this can be done now:

- with the new defaults, THP page faults no longer do reclaim/compaction at
  all, unless the system admin has overriden the default, or application has
  indicated via madvise that it can benefit from THP's. In both cases, it
  means that the potential extra latency is expected and worth the benefits.
- even if reclaim/compaction proceeds after this patch where it previously
  wouldn't, the second compaction attempt is still async and will detect the
  contention and back off, if the contention persists
- there are still heuristics like deferred compaction and pageblock skip bits
  in place that prevent excessive THP page fault latencies

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/compaction.h | 10 +------
 mm/compaction.c            | 72 +++++++++-------------------------------------
 mm/internal.h              |  5 +---
 mm/page_alloc.c            | 28 +-----------------
 4 files changed, 16 insertions(+), 99 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 900d181ff1b0..cd3a59f1601e 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -51,14 +51,6 @@ enum compact_result {
 	COMPACT_PARTIAL,
 };
 
-/* Used to signal whether compaction detected need_sched() or lock contention */
-/* No contention detected */
-#define COMPACT_CONTENDED_NONE	0
-/* Either need_sched() was true or fatal signal pending */
-#define COMPACT_CONTENDED_SCHED	1
-/* Zone lock or lru_lock was contended in async compaction */
-#define COMPACT_CONTENDED_LOCK	2
-
 struct alloc_context; /* in mm/internal.h */
 
 #ifdef CONFIG_COMPACTION
@@ -74,7 +66,7 @@ extern int fragmentation_index(struct zone *zone, unsigned int order);
 extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
 			unsigned int order,
 			unsigned int alloc_flags, const struct alloc_context *ac,
-			enum compact_priority prio, int *contended);
+			enum compact_priority prio);
 extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern enum compact_result compaction_suitable(struct zone *zone, int order,
diff --git a/mm/compaction.c b/mm/compaction.c
index abfd71e1f1a3..f649c7bc6de5 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -279,7 +279,7 @@ static bool compact_trylock_irqsave(spinlock_t *lock, unsigned long *flags,
 {
 	if (cc->mode == MIGRATE_ASYNC) {
 		if (!spin_trylock_irqsave(lock, *flags)) {
-			cc->contended = COMPACT_CONTENDED_LOCK;
+			cc->contended = true;
 			return false;
 		}
 	} else {
@@ -313,13 +313,13 @@ static bool compact_unlock_should_abort(spinlock_t *lock,
 	}
 
 	if (fatal_signal_pending(current)) {
-		cc->contended = COMPACT_CONTENDED_SCHED;
+		cc->contended = true;
 		return true;
 	}
 
 	if (need_resched()) {
 		if (cc->mode == MIGRATE_ASYNC) {
-			cc->contended = COMPACT_CONTENDED_SCHED;
+			cc->contended = true;
 			return true;
 		}
 		cond_resched();
@@ -342,7 +342,7 @@ static inline bool compact_should_abort(struct compact_control *cc)
 	/* async compaction aborts if contended */
 	if (need_resched()) {
 		if (cc->mode == MIGRATE_ASYNC) {
-			cc->contended = COMPACT_CONTENDED_SCHED;
+			cc->contended = true;
 			return true;
 		}
 
@@ -1564,14 +1564,11 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
 	trace_mm_compaction_end(start_pfn, cc->migrate_pfn,
 				cc->free_pfn, end_pfn, sync, ret);
 
-	if (ret == COMPACT_CONTENDED)
-		ret = COMPACT_PARTIAL;
-
 	return ret;
 }
 
 static enum compact_result compact_zone_order(struct zone *zone, int order,
-		gfp_t gfp_mask, enum compact_priority prio, int *contended,
+		gfp_t gfp_mask, enum compact_priority prio,
 		unsigned int alloc_flags, int classzone_idx)
 {
 	enum compact_result ret;
@@ -1595,7 +1592,6 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
 	VM_BUG_ON(!list_empty(&cc.freepages));
 	VM_BUG_ON(!list_empty(&cc.migratepages));
 
-	*contended = cc.contended;
 	return ret;
 }
 
@@ -1608,23 +1604,18 @@ int sysctl_extfrag_threshold = 500;
  * @alloc_flags: The allocation flags of the current allocation
  * @ac: The context of current allocation
  * @mode: The migration mode for async, sync light, or sync migration
- * @contended: Return value that determines if compaction was aborted due to
- *	       need_resched() or lock contention
  *
  * This is the main entry point for direct page compaction.
  */
 enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 			unsigned int alloc_flags, const struct alloc_context *ac,
-			enum compact_priority prio, int *contended)
+			enum compact_priority prio)
 {
 	int may_enter_fs = gfp_mask & __GFP_FS;
 	int may_perform_io = gfp_mask & __GFP_IO;
 	struct zoneref *z;
 	struct zone *zone;
 	enum compact_result rc = COMPACT_SKIPPED;
-	int all_zones_contended = COMPACT_CONTENDED_LOCK; /* init for &= op */
-
-	*contended = COMPACT_CONTENDED_NONE;
 
 	/* Check if the GFP flags allow compaction */
 	if (!order || !may_enter_fs || !may_perform_io)
@@ -1637,7 +1628,6 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
 								ac->nodemask) {
 		enum compact_result status;
-		int zone_contended;
 
 		if (compaction_deferred(zone, order)) {
 			rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
@@ -1645,14 +1635,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 		}
 
 		status = compact_zone_order(zone, order, gfp_mask, prio,
-				&zone_contended, alloc_flags,
-				ac_classzone_idx(ac));
+					alloc_flags, ac_classzone_idx(ac));
 		rc = max(status, rc);
-		/*
-		 * It takes at least one zone that wasn't lock contended
-		 * to clear all_zones_contended.
-		 */
-		all_zones_contended &= zone_contended;
 
 		/* If a normal allocation would succeed, stop compacting */
 		if (zone_watermark_ok(zone, order, low_wmark_pages(zone),
@@ -1664,59 +1648,29 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 			 * succeeds in this zone.
 			 */
 			compaction_defer_reset(zone, order, false);
-			/*
-			 * It is possible that async compaction aborted due to
-			 * need_resched() and the watermarks were ok thanks to
-			 * somebody else freeing memory. The allocation can
-			 * however still fail so we better signal the
-			 * need_resched() contention anyway (this will not
-			 * prevent the allocation attempt).
-			 */
-			if (zone_contended == COMPACT_CONTENDED_SCHED)
-				*contended = COMPACT_CONTENDED_SCHED;
 
-			goto break_loop;
+			break;
 		}
 
 		if (prio != COMPACT_PRIO_ASYNC && (status == COMPACT_COMPLETE ||
-					status == COMPACT_PARTIAL_SKIPPED)) {
+					status == COMPACT_PARTIAL_SKIPPED))
 			/*
 			 * We think that allocation won't succeed in this zone
 			 * so we defer compaction there. If it ends up
 			 * succeeding after all, it will be reset.
 			 */
 			defer_compaction(zone, order);
-		}
 
 		/*
 		 * We might have stopped compacting due to need_resched() in
 		 * async compaction, or due to a fatal signal detected. In that
-		 * case do not try further zones and signal need_resched()
-		 * contention.
-		 */
-		if ((zone_contended == COMPACT_CONTENDED_SCHED)
-					|| fatal_signal_pending(current)) {
-			*contended = COMPACT_CONTENDED_SCHED;
-			goto break_loop;
-		}
-
-		continue;
-break_loop:
-		/*
-		 * We might not have tried all the zones, so  be conservative
-		 * and assume they are not all lock contended.
+		 * case do not try further zones
 		 */
-		all_zones_contended = 0;
-		break;
+		if ((prio == COMPACT_PRIO_ASYNC && need_resched())
+					|| fatal_signal_pending(current))
+			break;
 	}
 
-	/*
-	 * If at least one zone wasn't deferred or skipped, we report if all
-	 * zones that were tried were lock contended.
-	 */
-	if (rc > COMPACT_INACTIVE && all_zones_contended)
-		*contended = COMPACT_CONTENDED_LOCK;
-
 	return rc;
 }
 
diff --git a/mm/internal.h b/mm/internal.h
index b6ead95a0184..556bc9d0a817 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -184,10 +184,7 @@ struct compact_control {
 	const unsigned int alloc_flags;	/* alloc flags of a direct compactor */
 	const int classzone_idx;	/* zone index of a direct compactor */
 	struct zone *zone;
-	int contended;			/* Signal need_sched() or lock
-					 * contention detected during
-					 * compaction
-					 */
+	bool contended;			/* Signal lock or sched contention */
 };
 
 unsigned long
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 17abc05be972..aa9c39a7f40a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3196,14 +3196,13 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 		enum compact_priority prio, enum compact_result *compact_result)
 {
 	struct page *page;
-	int contended_compaction;
 
 	if (!order)
 		return NULL;
 
 	current->flags |= PF_MEMALLOC;
 	*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac,
-						prio, &contended_compaction);
+									prio);
 	current->flags &= ~PF_MEMALLOC;
 
 	if (*compact_result <= COMPACT_INACTIVE)
@@ -3233,24 +3232,6 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	 */
 	count_vm_event(COMPACTFAIL);
 
-	/*
-	 * In all zones where compaction was attempted (and not
-	 * deferred or skipped), lock contention has been detected.
-	 * For THP allocation we do not want to disrupt the others
-	 * so we fallback to base pages instead.
-	 */
-	if (contended_compaction == COMPACT_CONTENDED_LOCK)
-		*compact_result = COMPACT_CONTENDED;
-
-	/*
-	 * If compaction was aborted due to need_resched(), we do not
-	 * want to further increase allocation latency, unless it is
-	 * khugepaged trying to collapse.
-	 */
-	if (contended_compaction == COMPACT_CONTENDED_SCHED
-		&& !(current->flags & PF_KTHREAD))
-		*compact_result = COMPACT_CONTENDED;
-
 	cond_resched();
 
 	return NULL;
@@ -3621,13 +3602,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				goto nopage;
 
 			/*
-			 * Compaction is contended so rather back off than cause
-			 * excessive stalls.
-			 */
-			if (compact_result == COMPACT_CONTENDED)
-				goto nopage;
-
-			/*
 			 * Looks like reclaim/compaction is worth trying, but
 			 * sync compaction could be very expensive, so keep
 			 * using async compaction.
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC 09/13] mm, compaction: make whole_zone flag ignore cached scanner positions
  2016-05-10  7:35 [RFC 00/13] make direct compaction more deterministic Vlastimil Babka
                   ` (7 preceding siblings ...)
  2016-05-10  7:35 ` [RFC 08/13] mm, compaction: simplify contended compaction handling Vlastimil Babka
@ 2016-05-10  7:35 ` Vlastimil Babka
  2016-05-13 13:23   ` Michal Hocko
  2016-05-10  7:36 ` [RFC 10/13] mm, compaction: cleanup unused functions Vlastimil Babka
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-10  7:35 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds, Vlastimil Babka

A recent patch has added whole_zone flag that compaction sets when scanning
starts from the zone boundary, in order to report that zone has been fully
scanned in one attempt. For allocations that want to try really hard or cannot
fail, we will want to introduce a mode where scanning whole zone is guaranteed
regardless of the cached positions.

This patch reuses the whole_zone flag in a way that if it's already passed true
to compaction, the cached scanner positions are ignored. Employing this flag
during reclaim/compaction loop will be done in the next patch. This patch
however converts compaction invoked from userspace via procfs to use this flag.
Before this patch, the cached positions were first reset to zone boundaries and
then read back from struct zone, so there was a window where a parallel
compaction could replace the reset values, making the manual compaction less
effective. Using the flag instead of performing reset is more robust.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/compaction.c | 15 +++++----------
 mm/internal.h   |  2 +-
 2 files changed, 6 insertions(+), 11 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index f649c7bc6de5..1ce6783d3ead 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1442,11 +1442,13 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
 	 */
 	cc->migrate_pfn = zone->compact_cached_migrate_pfn[sync];
 	cc->free_pfn = zone->compact_cached_free_pfn;
-	if (cc->free_pfn < start_pfn || cc->free_pfn >= end_pfn) {
+	if (cc->whole_zone || cc->free_pfn < start_pfn ||
+						cc->free_pfn >= end_pfn) {
 		cc->free_pfn = pageblock_start_pfn(end_pfn - 1);
 		zone->compact_cached_free_pfn = cc->free_pfn;
 	}
-	if (cc->migrate_pfn < start_pfn || cc->migrate_pfn >= end_pfn) {
+	if (cc->whole_zone || cc->migrate_pfn < start_pfn ||
+						cc->migrate_pfn >= end_pfn) {
 		cc->migrate_pfn = start_pfn;
 		zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn;
 		zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
@@ -1693,14 +1695,6 @@ static void __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
 		INIT_LIST_HEAD(&cc->freepages);
 		INIT_LIST_HEAD(&cc->migratepages);
 
-		/*
-		 * When called via /proc/sys/vm/compact_memory
-		 * this makes sure we compact the whole zone regardless of
-		 * cached scanner positions.
-		 */
-		if (is_via_compact_memory(cc->order))
-			__reset_isolation_suitable(zone);
-
 		if (is_via_compact_memory(cc->order) ||
 				!compaction_deferred(zone, cc->order))
 			compact_zone(zone, cc);
@@ -1736,6 +1730,7 @@ static void compact_node(int nid)
 		.order = -1,
 		.mode = MIGRATE_SYNC,
 		.ignore_skip_hint = true,
+		.whole_zone = true,
 	};
 
 	__compact_pgdat(NODE_DATA(nid), &cc);
diff --git a/mm/internal.h b/mm/internal.h
index 556bc9d0a817..2acdee8ab0e6 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -178,7 +178,7 @@ struct compact_control {
 	enum migrate_mode mode;		/* Async or sync migration mode */
 	bool ignore_skip_hint;		/* Scan blocks even if marked skip */
 	bool direct_compaction;		/* False from kcompactd or /proc/... */
-	bool whole_zone;		/* Whole zone has been scanned */
+	bool whole_zone;		/* Whole zone should/has been scanned */
 	int order;			/* order a direct compactor needs */
 	const gfp_t gfp_mask;		/* gfp mask of a direct compactor */
 	const unsigned int alloc_flags;	/* alloc flags of a direct compactor */
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC 10/13] mm, compaction: cleanup unused functions
  2016-05-10  7:35 [RFC 00/13] make direct compaction more deterministic Vlastimil Babka
                   ` (8 preceding siblings ...)
  2016-05-10  7:35 ` [RFC 09/13] mm, compaction: make whole_zone flag ignore cached scanner positions Vlastimil Babka
@ 2016-05-10  7:36 ` Vlastimil Babka
  2016-05-10  7:36 ` [RFC 11/13] mm, compaction: add the ultimate direct compaction priority Vlastimil Babka
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-10  7:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds, Vlastimil Babka

Since kswapd compaction moved to kcompactd, compact_pgdat() is not called
anymore, so we remove it. The only caller of __compact_pgdat() is
compact_node(), so we merge them and remove code that was only reachable from
kswapd.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/compaction.h |  5 ----
 mm/compaction.c            | 60 +++++++++++++---------------------------------
 2 files changed, 17 insertions(+), 48 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index cd3a59f1601e..eeaed24e87a8 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -67,7 +67,6 @@ extern enum compact_result try_to_compact_pages(gfp_t gfp_mask,
 			unsigned int order,
 			unsigned int alloc_flags, const struct alloc_context *ac,
 			enum compact_priority prio);
-extern void compact_pgdat(pg_data_t *pgdat, int order);
 extern void reset_isolation_suitable(pg_data_t *pgdat);
 extern enum compact_result compaction_suitable(struct zone *zone, int order,
 					unsigned int alloc_flags, int classzone_idx);
@@ -151,10 +150,6 @@ extern void kcompactd_stop(int nid);
 extern void wakeup_kcompactd(pg_data_t *pgdat, int order, int classzone_idx);
 
 #else
-static inline void compact_pgdat(pg_data_t *pgdat, int order)
-{
-}
-
 static inline void reset_isolation_suitable(pg_data_t *pgdat)
 {
 }
diff --git a/mm/compaction.c b/mm/compaction.c
index 1ce6783d3ead..7d0935e1a195 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1678,10 +1678,18 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 
 
 /* Compact all zones within a node */
-static void __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
+static void compact_node(int nid)
 {
+	pg_data_t *pgdat = NODE_DATA(nid);
 	int zoneid;
 	struct zone *zone;
+	struct compact_control cc = {
+		.order = -1,
+		.mode = MIGRATE_SYNC,
+		.ignore_skip_hint = true,
+		.whole_zone = true,
+	};
+
 
 	for (zoneid = 0; zoneid < MAX_NR_ZONES; zoneid++) {
 
@@ -1689,53 +1697,19 @@ static void __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
 		if (!populated_zone(zone))
 			continue;
 
-		cc->nr_freepages = 0;
-		cc->nr_migratepages = 0;
-		cc->zone = zone;
-		INIT_LIST_HEAD(&cc->freepages);
-		INIT_LIST_HEAD(&cc->migratepages);
-
-		if (is_via_compact_memory(cc->order) ||
-				!compaction_deferred(zone, cc->order))
-			compact_zone(zone, cc);
-
-		VM_BUG_ON(!list_empty(&cc->freepages));
-		VM_BUG_ON(!list_empty(&cc->migratepages));
+		cc.nr_freepages = 0;
+		cc.nr_migratepages = 0;
+		cc.zone = zone;
+		INIT_LIST_HEAD(&cc.freepages);
+		INIT_LIST_HEAD(&cc.migratepages);
 
-		if (is_via_compact_memory(cc->order))
-			continue;
+		compact_zone(zone, &cc);
 
-		if (zone_watermark_ok(zone, cc->order,
-				low_wmark_pages(zone), 0, 0))
-			compaction_defer_reset(zone, cc->order, false);
+		VM_BUG_ON(!list_empty(&cc.freepages));
+		VM_BUG_ON(!list_empty(&cc.migratepages));
 	}
 }
 
-void compact_pgdat(pg_data_t *pgdat, int order)
-{
-	struct compact_control cc = {
-		.order = order,
-		.mode = MIGRATE_ASYNC,
-	};
-
-	if (!order)
-		return;
-
-	__compact_pgdat(pgdat, &cc);
-}
-
-static void compact_node(int nid)
-{
-	struct compact_control cc = {
-		.order = -1,
-		.mode = MIGRATE_SYNC,
-		.ignore_skip_hint = true,
-		.whole_zone = true,
-	};
-
-	__compact_pgdat(NODE_DATA(nid), &cc);
-}
-
 /* Compact all nodes in the system */
 static void compact_nodes(void)
 {
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC 11/13] mm, compaction: add the ultimate direct compaction priority
  2016-05-10  7:35 [RFC 00/13] make direct compaction more deterministic Vlastimil Babka
                   ` (9 preceding siblings ...)
  2016-05-10  7:36 ` [RFC 10/13] mm, compaction: cleanup unused functions Vlastimil Babka
@ 2016-05-10  7:36 ` Vlastimil Babka
  2016-05-13 13:38   ` Michal Hocko
  2016-05-10  7:36 ` [RFC 12/13] mm, compaction: more reliably increase " Vlastimil Babka
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-10  7:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds, Vlastimil Babka

During reclaim/compaction loop, it's desirable to get a final answer from
unsuccessful compaction so we can either fail the allocation or invoke the OOM
killer. However, heuristics such as deferred compaction or pageblock skip bits
can cause compaction to skip parts or whole zones and lead to premature OOM's,
failures or excessive reclaim/compaction retries.

To remedy this, we introduce a new direct compaction priority called
COMPACT_PRIO_SYNC_FULL, which instructs direct compaction to:

- ignore deferred compaction status for a zone
- ignore pageblock skip hints
- ignore cached scanner positions and scan the whole zone
- use MIGRATE_SYNC migration mode

The new priority should get eventually picked up by should_compact_retry() and
this should improve success rates for costly allocations using __GFP_RETRY,
such as hugetlbfs allocations, and reduce some corner-case OOM's for non-costly
allocations.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/compaction.h |  1 +
 mm/compaction.c            | 15 ++++++++++++---
 2 files changed, 13 insertions(+), 3 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index eeaed24e87a8..af85c620c788 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -3,6 +3,7 @@
 
 // TODO: lower value means higher priority to match reclaim, makes sense?
 enum compact_priority {
+	COMPACT_PRIO_SYNC_FULL,
 	COMPACT_PRIO_SYNC_LIGHT,
 	DEF_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
 	COMPACT_PRIO_ASYNC,
diff --git a/mm/compaction.c b/mm/compaction.c
index 7d0935e1a195..9bc475dc4c99 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -1580,12 +1580,20 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
 		.order = order,
 		.gfp_mask = gfp_mask,
 		.zone = zone,
-		.mode = (prio == COMPACT_PRIO_ASYNC) ?
-					MIGRATE_ASYNC :	MIGRATE_SYNC_LIGHT,
 		.alloc_flags = alloc_flags,
 		.classzone_idx = classzone_idx,
 		.direct_compaction = true,
+		.whole_zone = (prio == COMPACT_PRIO_SYNC_FULL),
+		.ignore_skip_hint = (prio == COMPACT_PRIO_SYNC_FULL)
 	};
+
+	if (prio == COMPACT_PRIO_ASYNC)
+		cc.mode = MIGRATE_ASYNC;
+	else if (prio == COMPACT_PRIO_SYNC_LIGHT)
+		cc.mode = MIGRATE_SYNC_LIGHT;
+	else
+		cc.mode = MIGRATE_SYNC;
+
 	INIT_LIST_HEAD(&cc.freepages);
 	INIT_LIST_HEAD(&cc.migratepages);
 
@@ -1631,7 +1639,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 								ac->nodemask) {
 		enum compact_result status;
 
-		if (compaction_deferred(zone, order)) {
+		if (prio > COMPACT_PRIO_SYNC_FULL
+					&& compaction_deferred(zone, order)) {
 			rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
 			continue;
 		}
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC 12/13] mm, compaction: more reliably increase direct compaction priority
  2016-05-10  7:35 [RFC 00/13] make direct compaction more deterministic Vlastimil Babka
                   ` (10 preceding siblings ...)
  2016-05-10  7:36 ` [RFC 11/13] mm, compaction: add the ultimate direct compaction priority Vlastimil Babka
@ 2016-05-10  7:36 ` Vlastimil Babka
  2016-05-10 12:55   ` Vlastimil Babka
                     ` (2 more replies)
  2016-05-10  7:36 ` [RFC 13/13] mm, compaction: fix and improve watermark handling Vlastimil Babka
  2016-05-17 20:01 ` [RFC 00/13] make direct compaction more deterministic Michal Hocko
  13 siblings, 3 replies; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-10  7:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds, Vlastimil Babka

During reclaim/compaction loop, compaction priority can be increased by the
should_compact_retry() function, but the current code is not optimal for
several reasons:

- priority is only increased when compaction_failed() is true, which means
  that compaction has scanned the whole zone. This may not happen even after
  multiple attempts with the lower priority due to parallel activity, so we
  might needlessly struggle on the lower priority.

- should_compact_retry() is only called when should_reclaim_retry() returns
  false. This means that compaction priority cannot get increased as long
  as reclaim makes sufficient progress. Theoretically, reclaim should stop
  retrying for high-order allocations as long as the high-order page doesn't
  exist but due to races, this may result in spurious retries when the
  high-order page momentarily does exist.

We can remove these corner cases by making sure that should_compact_retry() is
always called, and increases compaction priority if possible. Examining further
the compaction result can be done only after reaching the highest priority.
This is a simple solution and we don't need to worry about reaching the highest
priority "too soon" here - when should_compact_retry() is called it means that
the system is already struggling and the allocation is supposed to either try
as hard as possible, or it cannot fail at all. There's not much point staying
at lower priorities with heuristics that may result in only partial compaction.

The only exception here is the COMPACT_SKIPPED result, which means that
compaction could not run at all due to failing order-0 watermarks. In that
case, don't increase compaction priority, and check if compaction could proceed
when everything reclaimable was reclaimed. Before this patch, this was tied to
compaction_withdrawn(), but the other results considered there are in fact only
due to low compaction priority so we can ignore them thanks to the patch.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 46 +++++++++++++++++++++++-----------------------
 1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index aa9c39a7f40a..623027fb8121 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3248,28 +3248,27 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
 		return false;
 
 	/*
-	 * compaction considers all the zone as desperately out of memory
-	 * so it doesn't really make much sense to retry except when the
-	 * failure could be caused by insufficient priority
+	 * Compaction backed off due to watermark checks for order-0
+	 * so the regular reclaim has to try harder and reclaim something
+	 * Retry only if it looks like reclaim might have a chance.
 	 */
-	if (compaction_failed(compact_result)) {
-		if (*compact_priority > 0) {
-			(*compact_priority)--;
-			return true;
-		}
-		return false;
-	}
+	if (compact_result == COMPACT_SKIPPED)
+		return compaction_zonelist_suitable(ac, order, alloc_flags);
 
 	/*
-	 * make sure the compaction wasn't deferred or didn't bail out early
-	 * due to locks contention before we declare that we should give up.
-	 * But do not retry if the given zonelist is not suitable for
-	 * compaction.
+	 * Compaction could have withdrawn early or skip some zones or
+	 * pageblocks. We were asked to retry, which means the allocation
+	 * should try really hard, so increase the priority if possible.
 	 */
-	if (compaction_withdrawn(compact_result))
-		return compaction_zonelist_suitable(ac, order, alloc_flags);
+	if (*compact_priority > 0) {
+		(*compact_priority)--;
+		return true;
+	}
 
 	/*
+	 * The remaining possibility is that compaction made progress and
+	 * created a high-order page, but it was allocated by somebody else.
+	 * To prevent thrashing, limit the number of retries in such case.
 	 * !costly requests are much more important than __GFP_REPEAT
 	 * costly ones because they are de facto nofail and invoke OOM
 	 * killer to move on while costly can fail and users are ready
@@ -3527,6 +3526,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 						struct alloc_context *ac)
 {
 	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
+	bool should_retry;
 	struct page *page = NULL;
 	unsigned int alloc_flags;
 	unsigned long did_some_progress;
@@ -3695,22 +3695,22 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 	else
 		no_progress_loops++;
 
-	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
-				 did_some_progress > 0, no_progress_loops))
-		goto retry;
-
+	should_retry = should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
+				 did_some_progress > 0, no_progress_loops);
 	/*
 	 * It doesn't make any sense to retry for the compaction if the order-0
 	 * reclaim is not able to make any progress because the current
 	 * implementation of the compaction depends on the sufficient amount
 	 * of free memory (see __compaction_suitable)
 	 */
-	if (did_some_progress > 0 &&
-			should_compact_retry(ac, order, alloc_flags,
+	if (did_some_progress > 0)
+		should_retry |= should_compact_retry(ac, order, alloc_flags,
 				compact_result, &compact_priority,
-				compaction_retries))
+				compaction_retries);
+	if (should_retry)
 		goto retry;
 
+
 	/* Reclaim has failed us, start killing things */
 	page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
 	if (page)
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* [RFC 13/13] mm, compaction: fix and improve watermark handling
  2016-05-10  7:35 [RFC 00/13] make direct compaction more deterministic Vlastimil Babka
                   ` (11 preceding siblings ...)
  2016-05-10  7:36 ` [RFC 12/13] mm, compaction: more reliably increase " Vlastimil Babka
@ 2016-05-10  7:36 ` Vlastimil Babka
  2016-05-16  9:25   ` Michal Hocko
  2016-05-17 20:01 ` [RFC 00/13] make direct compaction more deterministic Michal Hocko
  13 siblings, 1 reply; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-10  7:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds, Vlastimil Babka

Compaction has been using watermark checks when deciding whether it was
successful, and whether compaction is at all suitable. There are few problems
with these checks.

- __compact_finished() uses low watermark in a check that has to pass if
  the direct compaction is to finish and allocation should succeed. This is
  too pessimistic, as the allocation will typically use min watermark. It
  may happen that during compaction, we drop below the low watermark (due to
  parallel activity), but still form the target high-order page. By checking
  against low watermark, we might needlessly continue compaction. After this
  patch, the check uses direct compactor's alloc_flags to determine the
  watermark, which is effectively the min watermark.

- __compaction_suitable has the same issue in the check whether the allocation
  is already supposed to succeed and we don't need to compact. Fix it the same
  way.

- __compaction_suitable() then checks the low watermark plus a (2 << order) gap
  to decide if there's enough free memory to perform compaction. This check
  uses direct compactor's alloc_flags, but that's wrong. If alloc_flags doesn't
  include ALLOC_CMA, we might fail the check, even though the freepage
  isolation isn't restricted outside of CMA pageblocks. On the other hand,
  alloc_flags may indicate access to memory reserves, making compaction proceed
  and then fail watermark check during freepage isolation, which doesn't pass
  alloc_flags. The fix here is to use fixed ALLOC_CMA flags in the
  __compaction_suitable() check.

- __isolate_free_page uses low watermark check to decide if free page can be
  isolated. It also doesn't use ALLOC_CMA, so add it for the same reasons.

- The use of low watermark checks in __compaction_suitable() and
  __isolate_free_page does perhaps make sense for high-order allocations where
  more freepages increase the chance of success, and we can typically fail
  with some order-0 fallback when the system is struggling. But for low-order
  allocation, forming the page should not be that hard. So using low watermark
  here might just prevent compaction from even trying, and eventually lead to
  OOM killer even if we are above min watermarks. So after this patch, we use
  min watermark for non-costly orders in these checks, by passing the
  alloc_flags parameter to split_page() and __isolate_free_page().

To sum up, after this patch, the kernel should in some situations finish
successful direct compaction sooner, prevent compaction from starting when it's
not needed, proceed with compaction when free memory is in CMA pageblocks, and
for non-costly orders, prevent OOM killing or excessive reclaim when free
memory is between the min and low watermarks.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/mm.h  |  2 +-
 mm/compaction.c     | 28 +++++++++++++++++++++++-----
 mm/internal.h       |  3 ++-
 mm/page_alloc.c     | 13 ++++++++-----
 mm/page_isolation.c |  2 +-
 5 files changed, 35 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index db8979ce28a3..ce7248022114 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -518,7 +518,7 @@ void __put_page(struct page *page);
 void put_pages_list(struct list_head *pages);
 
 void split_page(struct page *page, unsigned int order);
-int split_free_page(struct page *page);
+int split_free_page(struct page *page, unsigned int alloc_flags);
 
 /*
  * Compound pages have a destructor function.  Provide a
diff --git a/mm/compaction.c b/mm/compaction.c
index 9bc475dc4c99..207b6c132d6d 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -368,6 +368,16 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 	unsigned long flags = 0;
 	bool locked = false;
 	unsigned long blockpfn = *start_pfn;
+	unsigned int alloc_flags;
+
+	/*
+	 * Determine how split_free_page() will check watermarks, in line with
+	 * compaction_suitable(). Pages in CMA pageblocks should be counted
+	 * as free for this purpose as a migratable page is likely movable
+	 */
+	alloc_flags = (cc->order > PAGE_ALLOC_COSTLY_ORDER) ?
+				ALLOC_WMARK_LOW : ALLOC_WMARK_MIN;
+	alloc_flags |= ALLOC_CMA;
 
 	cursor = pfn_to_page(blockpfn);
 
@@ -440,7 +450,7 @@ static unsigned long isolate_freepages_block(struct compact_control *cc,
 		}
 
 		/* Found a free page, break it into order-0 pages */
-		isolated = split_free_page(page);
+		isolated = split_free_page(page, alloc_flags);
 		total_isolated += isolated;
 		for (i = 0; i < isolated; i++) {
 			list_add(&page->lru, freelist);
@@ -1262,7 +1272,7 @@ static enum compact_result __compact_finished(struct zone *zone, struct compact_
 		return COMPACT_CONTINUE;
 
 	/* Compaction run is not finished if the watermark is not met */
-	watermark = low_wmark_pages(zone);
+	watermark = zone->watermark[cc->alloc_flags & ALLOC_WMARK_MASK];
 
 	if (!zone_watermark_ok(zone, cc->order, watermark, cc->classzone_idx,
 							cc->alloc_flags))
@@ -1327,7 +1337,7 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
 	if (is_via_compact_memory(order))
 		return COMPACT_CONTINUE;
 
-	watermark = low_wmark_pages(zone);
+	watermark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
 	/*
 	 * If watermarks for high-order allocation are already met, there
 	 * should be no need for compaction at all.
@@ -1339,11 +1349,19 @@ static enum compact_result __compaction_suitable(struct zone *zone, int order,
 	/*
 	 * Watermarks for order-0 must be met for compaction. Note the 2UL.
 	 * This is because during migration, copies of pages need to be
-	 * allocated and for a short time, the footprint is higher
+	 * allocated and for a short time, the footprint is higher. For
+	 * costly orders, we require low watermark instead of min for
+	 * compaction to proceed to increase its chances. Note that watermark
+	 * and alloc_flags here have to match (or be more pessimistic than)
+	 * the watermark checks done in __isolate_free_page(), and we use the
+	 * direct compactor's classzone_idx to skip over zones where
+	 * lowmem reserves would prevent allocation even if compaction succeeds
 	 */
+	watermark = (order > PAGE_ALLOC_COSTLY_ORDER) ?
+				low_wmark_pages(zone) : min_wmark_pages(zone);
 	watermark += (2UL << order);
 	if (!__zone_watermark_ok(zone, 0, watermark, classzone_idx,
-				 alloc_flags, wmark_target))
+						ALLOC_CMA, wmark_target))
 		return COMPACT_SKIPPED;
 
 	/*
diff --git a/mm/internal.h b/mm/internal.h
index 2acdee8ab0e6..62c1bf61953b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -149,7 +149,8 @@ static inline struct page *pageblock_pfn_to_page(unsigned long start_pfn,
 	return __pageblock_pfn_to_page(start_pfn, end_pfn, zone);
 }
 
-extern int __isolate_free_page(struct page *page, unsigned int order);
+extern int __isolate_free_page(struct page *page, unsigned int order,
+						unsigned int alloc_flags);
 extern void __free_pages_bootmem(struct page *page, unsigned long pfn,
 					unsigned int order);
 extern void prep_compound_page(struct page *page, unsigned int order);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 623027fb8121..2d74eddffcf6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2489,7 +2489,8 @@ void split_page(struct page *page, unsigned int order)
 }
 EXPORT_SYMBOL_GPL(split_page);
 
-int __isolate_free_page(struct page *page, unsigned int order)
+int __isolate_free_page(struct page *page, unsigned int order,
+						unsigned int alloc_flags)
 {
 	unsigned long watermark;
 	struct zone *zone;
@@ -2502,8 +2503,10 @@ int __isolate_free_page(struct page *page, unsigned int order)
 
 	if (!is_migrate_isolate(mt)) {
 		/* Obey watermarks as if the page was being allocated */
-		watermark = low_wmark_pages(zone) + (1 << order);
-		if (!zone_watermark_ok(zone, 0, watermark, 0, 0))
+		watermark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
+		/* We know our order page exists, so only check order-0 */
+		watermark += (1UL << order);
+		if (!zone_watermark_ok(zone, 0, watermark, 0, alloc_flags))
 			return 0;
 
 		__mod_zone_freepage_state(zone, -(1UL << order), mt);
@@ -2541,14 +2544,14 @@ int __isolate_free_page(struct page *page, unsigned int order)
  * Note: this is probably too low level an operation for use in drivers.
  * Please consult with lkml before using this in your driver.
  */
-int split_free_page(struct page *page)
+int split_free_page(struct page *page, unsigned int alloc_flags)
 {
 	unsigned int order;
 	int nr_pages;
 
 	order = page_order(page);
 
-	nr_pages = __isolate_free_page(page, order);
+	nr_pages = __isolate_free_page(page, order, alloc_flags);
 	if (!nr_pages)
 		return 0;
 
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index 612122bf6a42..0bcb7a32d84c 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -107,7 +107,7 @@ static void unset_migratetype_isolate(struct page *page, unsigned migratetype)
 
 			if (pfn_valid_within(page_to_pfn(buddy)) &&
 			    !is_migrate_isolate_page(buddy)) {
-				__isolate_free_page(page, order);
+				__isolate_free_page(page, order, 0);
 				kernel_map_pages(page, (1 << order), 1);
 				set_page_refcounted(page);
 				isolated_page = page;
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [RFC 02/13] mm, page_alloc: set alloc_flags only once in slowpath
  2016-05-10  7:35 ` [RFC 02/13] mm, page_alloc: set alloc_flags only once in slowpath Vlastimil Babka
@ 2016-05-10 11:28   ` Tetsuo Handa
  2016-05-10 12:30     ` Vlastimil Babka
  0 siblings, 1 reply; 61+ messages in thread
From: Tetsuo Handa @ 2016-05-10 11:28 UTC (permalink / raw)
  To: vbabka, mhocko
  Cc: linux-mm, akpm, iamjoonsoo.kim, riel, rientjes, mgorman, hannes,
	linux-kernel, torvalds

Vlastimil Babka wrote:
> In __alloc_pages_slowpath(), alloc_flags doesn't change after it's initialized,
> so move the initialization above the retry: label. Also make the comment above
> the initialization more descriptive.

Not true. gfp_to_alloc_flags() will include ALLOC_NO_WATERMARKS if current
thread got TIF_MEMDIE after gfp_to_alloc_flags() was called for the first
time. Do you want to make TIF_MEMDIE threads fail their allocations without
using memory reserves?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 02/13] mm, page_alloc: set alloc_flags only once in slowpath
  2016-05-10 11:28   ` Tetsuo Handa
@ 2016-05-10 12:30     ` Vlastimil Babka
  2016-05-12 12:41       ` Michal Hocko
  2016-05-31  6:20       ` Joonsoo Kim
  0 siblings, 2 replies; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-10 12:30 UTC (permalink / raw)
  To: Tetsuo Handa, mhocko
  Cc: linux-mm, akpm, iamjoonsoo.kim, riel, rientjes, mgorman, hannes,
	linux-kernel, torvalds

On 05/10/2016 01:28 PM, Tetsuo Handa wrote:
> Vlastimil Babka wrote:
>> In __alloc_pages_slowpath(), alloc_flags doesn't change after it's initialized,
>> so move the initialization above the retry: label. Also make the comment above
>> the initialization more descriptive.
> 
> Not true. gfp_to_alloc_flags() will include ALLOC_NO_WATERMARKS if current
> thread got TIF_MEMDIE after gfp_to_alloc_flags() was called for the first

Oh, right. Stupid global state.

> time. Do you want to make TIF_MEMDIE threads fail their allocations without
> using memory reserves?

No, thanks for catching this. How about the following version? I think
that's even nicer cleanup, if correct. Note it causes a conflict in
patch 03/13 but it's simple to resolve.

Thanks

----8<----
>From 68f09f1d4381c7451238b4575557580380d8bf30 Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Fri, 29 Apr 2016 11:51:17 +0200
Subject: [RFC 02/13] mm, page_alloc: set alloc_flags only once in slowpath

In __alloc_pages_slowpath(), alloc_flags doesn't change after it's initialized,
so move the initialization above the retry: label. Also make the comment above
the initialization more descriptive.

The only exception in the alloc_flags being constant is ALLOC_NO_WATERMARKS,
which may change due to TIF_MEMDIE being set on the allocating thread. We can
fix this, and make the code simpler and a bit more effective at the same time,
by moving the part that determines ALLOC_NO_WATERMARKS from
gfp_to_alloc_flags() to gfp_pfmemalloc_allowed(). This means we don't have to
mask out ALLOC_NO_WATERMARKS in several places in __alloc_pages_slowpath()
anymore.  The only test for the flag can instead call gfp_pfmemalloc_allowed().

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/page_alloc.c | 49 ++++++++++++++++++++++++-------------------------
 1 file changed, 24 insertions(+), 25 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a50184ec6ca0..1b58facf4b5e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3216,8 +3216,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
 	 */
 	count_vm_event(COMPACTSTALL);
 
-	page = get_page_from_freelist(gfp_mask, order,
-					alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
+	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
 
 	if (page) {
 		struct zone *zone = page_zone(page);
@@ -3366,8 +3365,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
 		return NULL;
 
 retry:
-	page = get_page_from_freelist(gfp_mask, order,
-					alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
+	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
 
 	/*
 	 * If an allocation failed after direct reclaim, it could be because
@@ -3425,16 +3423,6 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 	} else if (unlikely(rt_task(current)) && !in_interrupt())
 		alloc_flags |= ALLOC_HARDER;
 
-	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
-		if (gfp_mask & __GFP_MEMALLOC)
-			alloc_flags |= ALLOC_NO_WATERMARKS;
-		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
-			alloc_flags |= ALLOC_NO_WATERMARKS;
-		else if (!in_interrupt() &&
-				((current->flags & PF_MEMALLOC) ||
-				 unlikely(test_thread_flag(TIF_MEMDIE))))
-			alloc_flags |= ALLOC_NO_WATERMARKS;
-	}
 #ifdef CONFIG_CMA
 	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
 		alloc_flags |= ALLOC_CMA;
@@ -3444,7 +3432,19 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
 
 bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
 {
-	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
+	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
+		return false;
+
+	if (gfp_mask & __GFP_MEMALLOC)
+		return true;
+	if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
+		return true;
+	if (!in_interrupt() &&
+			((current->flags & PF_MEMALLOC) ||
+			 unlikely(test_thread_flag(TIF_MEMDIE))))
+		return true;
+
+	return false;
 }
 
 static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
@@ -3579,25 +3579,24 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
 		gfp_mask &= ~__GFP_ATOMIC;
 
-retry:
-	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
-		wake_all_kswapds(order, ac);
-
 	/*
-	 * OK, we're below the kswapd watermark and have kicked background
-	 * reclaim. Now things get more complex, so set up alloc_flags according
-	 * to how we want to proceed.
+	 * The fast path uses conservative alloc_flags to succeed only until
+	 * kswapd needs to be woken up, and to avoid the cost of setting up
+	 * alloc_flags precisely. So we do that now.
 	 */
 	alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
+retry:
+	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
+		wake_all_kswapds(order, ac);
+
 	/* This is the last chance, in general, before the goto nopage. */
-	page = get_page_from_freelist(gfp_mask, order,
-				alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
+	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
 	if (page)
 		goto got_pg;
 
 	/* Allocate without watermarks if the context allows */
-	if (alloc_flags & ALLOC_NO_WATERMARKS) {
+	if (gfp_pfmemalloc_allowed(gfp_mask)) {
 		/*
 		 * Ignore mempolicies if ALLOC_NO_WATERMARKS on the grounds
 		 * the allocation is high priority and these type of
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [RFC 12/13] mm, compaction: more reliably increase direct compaction priority
  2016-05-10  7:36 ` [RFC 12/13] mm, compaction: more reliably increase " Vlastimil Babka
@ 2016-05-10 12:55   ` Vlastimil Babka
  2016-05-13 14:15   ` Michal Hocko
  2016-05-31  6:37   ` Joonsoo Kim
  2 siblings, 0 replies; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-10 12:55 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On 05/10/2016 09:36 AM, Vlastimil Babka wrote:
>   	/*
> -	 * compaction considers all the zone as desperately out of memory
> -	 * so it doesn't really make much sense to retry except when the
> -	 * failure could be caused by insufficient priority
> +	 * Compaction backed off due to watermark checks for order-0
> +	 * so the regular reclaim has to try harder and reclaim something
> +	 * Retry only if it looks like reclaim might have a chance.
>   	 */
> -	if (compaction_failed(compact_result)) {
> -		if (*compact_priority > 0) {
> -			(*compact_priority)--;
> -			return true;
> -		}
> -		return false;
> -	}

Oops, looks like my editing resulted in compaction_failed() check to be
removed completely, which wasn't intentional and can lead to infinite
loops. This should be added on top.

----8<----
>From 59a2b38689aa451f661c964dc9bfb990736ad92d Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Tue, 10 May 2016 14:51:03 +0200
Subject: [PATCH 15/15] fixup! mm, compaction: more reliably increase direct
 compaction priority

---
 mm/page_alloc.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fa49eb4a5919..e8a0d33cfb67 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3268,6 +3268,14 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
 	}
 
 	/*
+	 * Compaction considers all the zones as unfixably fragmented and we
+	 * are on the highest priority, which means it can't be due to
+	 * heuristics and it doesn't really make much sense to retry.
+	 */
+	if (compaction_failed(compact_result))
+		return false;
+
+	/*
 	 * The remaining possibility is that compaction made progress and
 	 * created a high-order page, but it was allocated by somebody else.
 	 * To prevent thrashing, limit the number of retries in such case.
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [RFC 01/13] mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode
  2016-05-10  7:35 ` [RFC 01/13] mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode Vlastimil Babka
@ 2016-05-11 12:40   ` Michal Hocko
  0 siblings, 0 replies; 61+ messages in thread
From: Michal Hocko @ 2016-05-11 12:40 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds, Hugh Dickins

On Tue 10-05-16 09:35:51, Vlastimil Babka wrote:
> From: Hugh Dickins <hughd@google.com>
> 
> At present MIGRATE_SYNC_LIGHT is allowing __isolate_lru_page() to
> isolate a PageWriteback page, which __unmap_and_move() then rejects
> with -EBUSY: of course the writeback might complete in between, but
> that's not what we usually expect, so probably better not to isolate it.

this makes a lot of sense regardless the rest of the series. I will have
a look at the rest tomorrow more closely.

> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/compaction.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index c72987603343..481004c73c90 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1146,7 +1146,7 @@ static isolate_migrate_t isolate_migratepages(struct zone *zone,
>  	struct page *page;
>  	const isolate_mode_t isolate_mode =
>  		(sysctl_compact_unevictable_allowed ? ISOLATE_UNEVICTABLE : 0) |
> -		(cc->mode == MIGRATE_ASYNC ? ISOLATE_ASYNC_MIGRATE : 0);
> +		(cc->mode != MIGRATE_SYNC ? ISOLATE_ASYNC_MIGRATE : 0);
>  
>  	/*
>  	 * Start at where we last stopped, or beginning of the zone as
> -- 
> 2.8.2

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 02/13] mm, page_alloc: set alloc_flags only once in slowpath
  2016-05-10 12:30     ` Vlastimil Babka
@ 2016-05-12 12:41       ` Michal Hocko
  2016-05-31  6:20       ` Joonsoo Kim
  1 sibling, 0 replies; 61+ messages in thread
From: Michal Hocko @ 2016-05-12 12:41 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Tetsuo Handa, linux-mm, akpm, iamjoonsoo.kim, riel, rientjes,
	mgorman, hannes, linux-kernel, torvalds

On Tue 10-05-16 14:30:11, Vlastimil Babka wrote:
[...]
> >From 68f09f1d4381c7451238b4575557580380d8bf30 Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka <vbabka@suse.cz>
> Date: Fri, 29 Apr 2016 11:51:17 +0200
> Subject: [RFC 02/13] mm, page_alloc: set alloc_flags only once in slowpath
> 
> In __alloc_pages_slowpath(), alloc_flags doesn't change after it's initialized,
> so move the initialization above the retry: label. Also make the comment above
> the initialization more descriptive.
> 
> The only exception in the alloc_flags being constant is ALLOC_NO_WATERMARKS,
> which may change due to TIF_MEMDIE being set on the allocating thread. We can
> fix this, and make the code simpler and a bit more effective at the same time,
> by moving the part that determines ALLOC_NO_WATERMARKS from
> gfp_to_alloc_flags() to gfp_pfmemalloc_allowed(). This means we don't have to
> mask out ALLOC_NO_WATERMARKS in several places in __alloc_pages_slowpath()
> anymore.  The only test for the flag can instead call gfp_pfmemalloc_allowed().

I like this _very_ much! gfp_to_alloc_flags was really ugly doing two
separate things and it is nice to split them up and give the whole thing
more sense. gfp_pfmemalloc_allowed() is the bright example of it.

> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
 
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 49 ++++++++++++++++++++++++-------------------------
>  1 file changed, 24 insertions(+), 25 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a50184ec6ca0..1b58facf4b5e 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3216,8 +3216,7 @@ __alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
>  	 */
>  	count_vm_event(COMPACTSTALL);
>  
> -	page = get_page_from_freelist(gfp_mask, order,
> -					alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
> +	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
>  
>  	if (page) {
>  		struct zone *zone = page_zone(page);
> @@ -3366,8 +3365,7 @@ __alloc_pages_direct_reclaim(gfp_t gfp_mask, unsigned int order,
>  		return NULL;
>  
>  retry:
> -	page = get_page_from_freelist(gfp_mask, order,
> -					alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
> +	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
>  
>  	/*
>  	 * If an allocation failed after direct reclaim, it could be because
> @@ -3425,16 +3423,6 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  	} else if (unlikely(rt_task(current)) && !in_interrupt())
>  		alloc_flags |= ALLOC_HARDER;
>  
> -	if (likely(!(gfp_mask & __GFP_NOMEMALLOC))) {
> -		if (gfp_mask & __GFP_MEMALLOC)
> -			alloc_flags |= ALLOC_NO_WATERMARKS;
> -		else if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
> -			alloc_flags |= ALLOC_NO_WATERMARKS;
> -		else if (!in_interrupt() &&
> -				((current->flags & PF_MEMALLOC) ||
> -				 unlikely(test_thread_flag(TIF_MEMDIE))))
> -			alloc_flags |= ALLOC_NO_WATERMARKS;
> -	}
>  #ifdef CONFIG_CMA
>  	if (gfpflags_to_migratetype(gfp_mask) == MIGRATE_MOVABLE)
>  		alloc_flags |= ALLOC_CMA;
> @@ -3444,7 +3432,19 @@ gfp_to_alloc_flags(gfp_t gfp_mask)
>  
>  bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  {
> -	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
> +	if (unlikely(gfp_mask & __GFP_NOMEMALLOC))
> +		return false;
> +
> +	if (gfp_mask & __GFP_MEMALLOC)
> +		return true;
> +	if (in_serving_softirq() && (current->flags & PF_MEMALLOC))
> +		return true;
> +	if (!in_interrupt() &&
> +			((current->flags & PF_MEMALLOC) ||
> +			 unlikely(test_thread_flag(TIF_MEMDIE))))
> +		return true;
> +
> +	return false;
>  }
>  
>  static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
> @@ -3579,25 +3579,24 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
>  		gfp_mask &= ~__GFP_ATOMIC;
>  
> -retry:
> -	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> -		wake_all_kswapds(order, ac);
> -
>  	/*
> -	 * OK, we're below the kswapd watermark and have kicked background
> -	 * reclaim. Now things get more complex, so set up alloc_flags according
> -	 * to how we want to proceed.
> +	 * The fast path uses conservative alloc_flags to succeed only until
> +	 * kswapd needs to be woken up, and to avoid the cost of setting up
> +	 * alloc_flags precisely. So we do that now.
>  	 */
>  	alloc_flags = gfp_to_alloc_flags(gfp_mask);
>  
> +retry:
> +	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> +		wake_all_kswapds(order, ac);
> +
>  	/* This is the last chance, in general, before the goto nopage. */
> -	page = get_page_from_freelist(gfp_mask, order,
> -				alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
> +	page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
>  	if (page)
>  		goto got_pg;
>  
>  	/* Allocate without watermarks if the context allows */
> -	if (alloc_flags & ALLOC_NO_WATERMARKS) {
> +	if (gfp_pfmemalloc_allowed(gfp_mask)) {
>  		/*
>  		 * Ignore mempolicies if ALLOC_NO_WATERMARKS on the grounds
>  		 * the allocation is high priority and these type of
> -- 
> 2.8.2
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 03/13] mm, page_alloc: don't retry initial attempt in slowpath
  2016-05-10  7:35 ` [RFC 03/13] mm, page_alloc: don't retry initial attempt " Vlastimil Babka
@ 2016-05-12 12:48   ` Michal Hocko
  2016-05-31  6:25   ` Joonsoo Kim
  1 sibling, 0 replies; 61+ messages in thread
From: Michal Hocko @ 2016-05-12 12:48 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On Tue 10-05-16 09:35:53, Vlastimil Babka wrote:
> After __alloc_pages_slowpath() sets up new alloc_flags and wakes up kswapd, it
> first tries get_page_from_freelist() with the new alloc_flags, as it may
> succeed e.g. due to using min watermark instead of low watermark. This attempt
> does not have to be retried on each loop, since direct reclaim, direct
> compaction and oom call get_page_from_freelist() themselves.
> 
> This patch therefore moves the initial attempt above the retry label. The
> ALLOC_NO_WATERMARKS attempt is kept under retry label as it's special and
> should be retried after each loop.

Yes this makes code both more clear and more logical

> Kswapd wakeups are also done on each retry
> to be safe from potential races resulting in kswapd going to sleep while a
> process (that may not be able to reclaim by itself) is still looping.

I am not sure this is really necessary but it shouldn't be harmful. The
comment clarifies the duplicity so we are not risking "cleanups to
remove duplicated code" I guess.

> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 91fbf6f95403..7249949d65ca 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3586,16 +3586,23 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	 */
>  	alloc_flags = gfp_to_alloc_flags(gfp_mask);
>  
> -retry:
>  	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
>  		wake_all_kswapds(order, ac);
>  
> -	/* This is the last chance, in general, before the goto nopage. */
> +	/*
> +	 * The adjusted alloc_flags might result in immediate success, so try
> +	 * that first
> +	 */
>  	page = get_page_from_freelist(gfp_mask, order,
>  				alloc_flags & ~ALLOC_NO_WATERMARKS, ac);
>  	if (page)
>  		goto got_pg;
>  
> +retry:
> +	/* Ensure kswapd doesn't accidentaly go to sleep as long as we loop */
> +	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> +		wake_all_kswapds(order, ac);
> +
>  	/* Allocate without watermarks if the context allows */
>  	if (alloc_flags & ALLOC_NO_WATERMARKS) {
>  		/*
> -- 
> 2.8.2
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 04/13] mm, page_alloc: restructure direct compaction handling in slowpath
  2016-05-10  7:35 ` [RFC 04/13] mm, page_alloc: restructure direct compaction handling " Vlastimil Babka
@ 2016-05-12 13:29   ` Michal Hocko
  2016-05-13  8:10     ` Vlastimil Babka
  0 siblings, 1 reply; 61+ messages in thread
From: Michal Hocko @ 2016-05-12 13:29 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On Tue 10-05-16 09:35:54, Vlastimil Babka wrote:
> The retry loop in __alloc_pages_slowpath is supposed to keep trying reclaim
> and compaction (and OOM), until either the allocation succeeds, or returns
> with failure. Success here is more probable when reclaim precedes compaction,
> as certain watermarks have to be met for compaction to even try, and more free
> pages increase the probability of compaction success. On the other hand,
> starting with light async compaction (if the watermarks allow it), can be
> more efficient, especially for smaller orders, if there's enough free memory
> which is just fragmented.
> 
> Thus, the current code starts with compaction before reclaim, and to make sure
> that the last reclaim is always followed by a final compaction, there's another
> direct compaction call at the end of the loop. This makes the code hard to
> follow and adds some duplicated handling of migration_mode decisions. It's also
> somewhat inefficient that even if reclaim or compaction decides not to retry,
> the final compaction is still attempted. Some gfp flags combination also
> shortcut these retry decisions by "goto noretry;", making it even harder to
> follow.

I completely agree. It was a head scratcher to properly handle all the
potential paths when I was reorganizing the code for the oom detection
rework.

> This patch attempts to restructure the code with only minimal functional
> changes. The call to the first compaction and THP-specific checks are now
> placed above the retry loop, and the "noretry" direct compaction is removed.
> 
> The initial compaction is additionally restricted only to costly orders, as we
> can expect smaller orders to be held back by watermarks, and only larger orders
> to suffer primarily from fragmentation. This better matches the checks in
> reclaim's shrink_zones().
> 
> There are two other smaller functional changes. One is that the upgrade from
> async migration to light sync migration will always occur after the initial
> compaction.

I do not think this belongs to the patch. There are two reasons. First
we do not need to do potentially more expensive sync mode when async is
able to make some progress and the second is that with the currently
fragile compaction implementation this might reintroduce the premature
OOM for order-2 requests reported by Hugh. Please see
http://lkml.kernel.org/r/alpine.LSU.2.11.1604141114290.1086@eggly.anvils

Your later patch (which I haven't reviewed yet) is then changing this
considerably but I think it would be safer to not touch the migration
mode in this - mostly cleanup - patch.

> This is how it has been until recent patch "mm, oom: protect
> !costly allocations some more", which introduced upgrading the mode based on
> COMPACT_COMPLETE result, but kept the final compaction always upgraded, which
> made it even more special. It's better to return to the simpler handling for
> now, as migration modes will be further modified later in the series.
> 
> The second change is that once both reclaim and compaction declare it's not
> worth to retry the reclaim/compact loop, there is no final compaction attempt.
> As argued above, this is intentional. If that final compaction were to succeed,
> it would be due to a wrong retry decision, or simply a race with somebody else
> freeing memory for us.
> 
> The main outcome of this patch should be simpler code. Logically, the initial
> compaction without reclaim is the exceptional case to the reclaim/compaction
> scheme, but prior to the patch, it was the last loop iteration that was
> exceptional. Now the code matches the logic better. The change also enable the
> following patches.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Other than the above thing I like this patch.
Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 107 +++++++++++++++++++++++++++++---------------------------
>  1 file changed, 55 insertions(+), 52 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 7249949d65ca..88d680b3e7b6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3555,7 +3555,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	struct page *page = NULL;
>  	unsigned int alloc_flags;
>  	unsigned long did_some_progress;
> -	enum migrate_mode migration_mode = MIGRATE_ASYNC;
> +	enum migrate_mode migration_mode = MIGRATE_SYNC_LIGHT;
>  	enum compact_result compact_result;
>  	int compaction_retries = 0;
>  	int no_progress_loops = 0;
> @@ -3598,6 +3598,50 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	if (page)
>  		goto got_pg;
>  
> +	/*
> +	 * For costly allocations, try direct compaction first, as it's likely
> +	 * that we have enough base pages and don't need to reclaim.
> +	 */
> +	if (can_direct_reclaim && order > PAGE_ALLOC_COSTLY_ORDER) {
> +		page = __alloc_pages_direct_compact(gfp_mask, order,
> +						alloc_flags, ac,
> +						MIGRATE_ASYNC,
> +						&compact_result);
> +		if (page)
> +			goto got_pg;
> +
> +		/* Checks for THP-specific high-order allocations */
> +		if (is_thp_gfp_mask(gfp_mask)) {
> +			/*
> +			 * If compaction is deferred for high-order allocations,
> +			 * it is because sync compaction recently failed. If
> +			 * this is the case and the caller requested a THP
> +			 * allocation, we do not want to heavily disrupt the
> +			 * system, so we fail the allocation instead of entering
> +			 * direct reclaim.
> +			 */
> +			if (compact_result == COMPACT_DEFERRED)
> +				goto nopage;
> +
> +			/*
> +			 * Compaction is contended so rather back off than cause
> +			 * excessive stalls.
> +			 */
> +			if (compact_result == COMPACT_CONTENDED)
> +				goto nopage;
> +
> +			/*
> +			 * It can become very expensive to allocate transparent
> +			 * hugepages at fault, so use asynchronous memory
> +			 * compaction for THP unless it is khugepaged trying to
> +			 * collapse. All other requests should tolerate at
> +			 * least light sync migration.
> +			 */
> +			if (!(current->flags & PF_KTHREAD))
> +				migration_mode = MIGRATE_ASYNC;
> +		}
> +	}
> +
>  retry:
>  	/* Ensure kswapd doesn't accidentaly go to sleep as long as we loop */
>  	if (gfp_mask & __GFP_KSWAPD_RECLAIM)
> @@ -3646,55 +3690,33 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
>  		goto nopage;
>  
> -	/*
> -	 * Try direct compaction. The first pass is asynchronous. Subsequent
> -	 * attempts after direct reclaim are synchronous
> -	 */
> +
> +	/* Try direct reclaim and then allocating */
> +	page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
> +							&did_some_progress);
> +	if (page)
> +		goto got_pg;
> +
> +	/* Try direct compaction and then allocating */
>  	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
>  					migration_mode,
>  					&compact_result);
>  	if (page)
>  		goto got_pg;
>  
> -	/* Checks for THP-specific high-order allocations */
> -	if (is_thp_gfp_mask(gfp_mask)) {
> -		/*
> -		 * If compaction is deferred for high-order allocations, it is
> -		 * because sync compaction recently failed. If this is the case
> -		 * and the caller requested a THP allocation, we do not want
> -		 * to heavily disrupt the system, so we fail the allocation
> -		 * instead of entering direct reclaim.
> -		 */
> -		if (compact_result == COMPACT_DEFERRED)
> -			goto nopage;
> -
> -		/*
> -		 * Compaction is contended so rather back off than cause
> -		 * excessive stalls.
> -		 */
> -		if(compact_result == COMPACT_CONTENDED)
> -			goto nopage;
> -	}
> -
>  	if (order && compaction_made_progress(compact_result))
>  		compaction_retries++;
>  
> -	/* Try direct reclaim and then allocating */
> -	page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
> -							&did_some_progress);
> -	if (page)
> -		goto got_pg;
> -
>  	/* Do not loop if specifically requested */
>  	if (gfp_mask & __GFP_NORETRY)
> -		goto noretry;
> +		goto nopage;
>  
>  	/*
>  	 * Do not retry costly high order allocations unless they are
>  	 * __GFP_REPEAT
>  	 */
>  	if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
> -		goto noretry;
> +		goto nopage;
>  
>  	/*
>  	 * Costly allocations might have made a progress but this doesn't mean
> @@ -3733,25 +3755,6 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		goto retry;
>  	}
>  
> -noretry:
> -	/*
> -	 * High-order allocations do not necessarily loop after direct reclaim
> -	 * and reclaim/compaction depends on compaction being called after
> -	 * reclaim so call directly if necessary.
> -	 * It can become very expensive to allocate transparent hugepages at
> -	 * fault, so use asynchronous memory compaction for THP unless it is
> -	 * khugepaged trying to collapse. All other requests should tolerate
> -	 * at least light sync migration.
> -	 */
> -	if (is_thp_gfp_mask(gfp_mask) && !(current->flags & PF_KTHREAD))
> -		migration_mode = MIGRATE_ASYNC;
> -	else
> -		migration_mode = MIGRATE_SYNC_LIGHT;
> -	page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
> -					    ac, migration_mode,
> -					    &compact_result);
> -	if (page)
> -		goto got_pg;
>  nopage:
>  	warn_alloc_failed(gfp_mask, order, NULL);
>  got_pg:
> -- 
> 2.8.2
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 05/13] mm, page_alloc: make THP-specific decisions more generic
  2016-05-10  7:35 ` [RFC 05/13] mm, page_alloc: make THP-specific decisions more generic Vlastimil Babka
@ 2016-05-12 13:43   ` Michal Hocko
  0 siblings, 0 replies; 61+ messages in thread
From: Michal Hocko @ 2016-05-12 13:43 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On Tue 10-05-16 09:35:55, Vlastimil Babka wrote:
> Since THP allocations during page faults can be costly, extra decisions are
> employed for them to avoid excessive reclaim and compaction, if the initial
> compaction doesn't look promising. The detection has never been perfect as
> there is no gfp flag specific to THP allocations. At this moment it checks the
> whole combination of flags that makes up GFP_TRANSHUGE, and hopes that no other
> users of such combination exist, or would mind being treated the same way.
> Extra care is also taken to separate allocations from khugepaged, where latency
> doesn't matter that much.
> 
> It is however possible to distinguish these allocations in a simpler and more
> reliable way. The key observation is that after the initial compaction followed
> by the first iteration of "standard" reclaim/compaction, both __GFP_NORETRY
> allocations and costly allocations without __GFP_REPEAT are declared as
> failures:
> 
>         /* Do not loop if specifically requested */
>         if (gfp_mask & __GFP_NORETRY)
>                 goto nopage;
> 
>         /*
>          * Do not retry costly high order allocations unless they are
>          * __GFP_REPEAT
>          */
>         if (order > PAGE_ALLOC_COSTLY_ORDER && !(gfp_mask & __GFP_REPEAT))
>                 goto nopage;
> 
> This means we can further distinguish allocations that are costly order *and*
> additionally include the __GFP_NORETRY flag. As it happens, GFP_TRANSHUGE
> allocations do already fall into this category. This will also allow other
> costly allocations with similar high-order benefit vs latency considerations to
> use this semantic. Furthermore, we can distinguish THP allocations that should
> try a bit harder (such as from khugepageed) by removing __GFP_NORETRY, as will
> be done in the next patch.

Yes, using __GFP_NORETRY makes perfect sense. It is the weakest mode for
the costly allocation which includes both compaction and reclaim. I am
happy to see is_thp_gfp_mask going away.

> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/page_alloc.c | 22 +++++++++-------------
>  1 file changed, 9 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 88d680b3e7b6..f5d931e0854a 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3182,7 +3182,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
>  	return page;
>  }
>  
> -
>  /*
>   * Maximum number of compaction retries wit a progress before OOM
>   * killer is consider as the only way to move forward.
> @@ -3447,11 +3446,6 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
>  	return !!(gfp_to_alloc_flags(gfp_mask) & ALLOC_NO_WATERMARKS);
>  }
>  
> -static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
> -{
> -	return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
> -}
> -
>  /*
>   * Maximum number of reclaim retries without any progress before OOM killer
>   * is consider as the only way to move forward.
> @@ -3610,8 +3604,11 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  		if (page)
>  			goto got_pg;
>  
> -		/* Checks for THP-specific high-order allocations */
> -		if (is_thp_gfp_mask(gfp_mask)) {
> +		/*
> +		 * Checks for costly allocations with __GFP_NORETRY, which
> +		 * includes THP page fault allocations
> +		 */
> +		if (gfp_mask & __GFP_NORETRY) {
>  			/*
>  			 * If compaction is deferred for high-order allocations,
>  			 * it is because sync compaction recently failed. If
> @@ -3631,11 +3628,10 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  				goto nopage;
>  
>  			/*
> -			 * It can become very expensive to allocate transparent
> -			 * hugepages at fault, so use asynchronous memory
> -			 * compaction for THP unless it is khugepaged trying to
> -			 * collapse. All other requests should tolerate at
> -			 * least light sync migration.
> +			 * Looks like reclaim/compaction is worth trying, but
> +			 * sync compaction could be very expensive, so keep
> +			 * using async compaction, unless it's khugepaged
> +			 * trying to collapse.
>  			 */
>  			if (!(current->flags & PF_KTHREAD))
>  				migration_mode = MIGRATE_ASYNC;
> -- 
> 2.8.2
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 06/13] mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
  2016-05-10  7:35 ` [RFC 06/13] mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations Vlastimil Babka
@ 2016-05-12 16:20   ` Michal Hocko
  2016-05-13  8:23     ` Vlastimil Babka
  0 siblings, 1 reply; 61+ messages in thread
From: Michal Hocko @ 2016-05-12 16:20 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On Tue 10-05-16 09:35:56, Vlastimil Babka wrote:
[...]
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 570383a41853..0cb09714d960 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -256,8 +256,7 @@ struct vm_area_struct;
>  #define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
>  #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
>  #define GFP_TRANSHUGE	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
> -			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
> -			 ~__GFP_RECLAIM)
> +			 __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)

I am not sure this is the right thing to do. I think we should keep
__GFP_NORETRY and clear it where we want a stronger semantic. This is
just too suble that all callsites are doing the right thing.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 04/13] mm, page_alloc: restructure direct compaction handling in slowpath
  2016-05-12 13:29   ` Michal Hocko
@ 2016-05-13  8:10     ` Vlastimil Babka
  2016-05-13  8:31       ` Michal Hocko
  0 siblings, 1 reply; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-13  8:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds, Hugh Dickins

On 05/12/2016 03:29 PM, Michal Hocko wrote:
> On Tue 10-05-16 09:35:54, Vlastimil Babka wrote:
>> This patch attempts to restructure the code with only minimal functional
>> changes. The call to the first compaction and THP-specific checks are now
>> placed above the retry loop, and the "noretry" direct compaction is removed.
>>
>> The initial compaction is additionally restricted only to costly orders, as we
>> can expect smaller orders to be held back by watermarks, and only larger orders
>> to suffer primarily from fragmentation. This better matches the checks in
>> reclaim's shrink_zones().
>>
>> There are two other smaller functional changes. One is that the upgrade from
>> async migration to light sync migration will always occur after the initial
>> compaction.
>
> I do not think this belongs to the patch. There are two reasons. First
> we do not need to do potentially more expensive sync mode when async is
> able to make some progress and the second

My concern was that __GFP_NORETRY non-costly allocations wouldn't 
otherwise get a MIGRATE_SYNC_LIGHT pass at all. Previously they would 
get it in the noretry: label. So do you think it's a corner case not 
worth caring about? Alternatively we could also remove the 'restriction 
of initial async compaction to costly orders' from this patch and apply 
it separately later. That would also result in the flip to sync_light 
after the initial async attempt for these allocations.

> is that with the currently
> fragile compaction implementation this might reintroduce the premature
> OOM for order-2 requests reported by Hugh. Please see
> http://lkml.kernel.org/r/alpine.LSU.2.11.1604141114290.1086@eggly.anvils

Hmm IIRC that involved some wrong conflict resolution in mmotm? I don't 
remember what the code exactly did look like, but wasn't the problem 
that the initial compaction was async, then the left-over hunk changed 
migration_mode to sync_light, and then should_compact_retry() thought 
"oh we already failed sync_light, return false" when in fact the 
sync_light compaction never happened? Otherwise I don't see how 
switching to sync_light "too early" could lead to premature OOMs.

> Your later patch (which I haven't reviewed yet) is then changing this
> considerably

Yes, my other concern with should_compact_retry() after your "mm, oom: 
protect !costly allocations some more" is that relying on 
compaction_failed() to upgrade the migration mode is unreliable. Async 
compaction can easily keep returning as contended, so might never see 
the COMPACT_COMPLETE result, if it's e.g. limited to nodes without a 
really small zone such as ZONE_DMA.

> but I think it would be safer to not touch the migration
> mode in this - mostly cleanup - patch.
>
>> This is how it has been until recent patch "mm, oom: protect
>> !costly allocations some more", which introduced upgrading the mode based on
>> COMPACT_COMPLETE result, but kept the final compaction always upgraded, which
>> made it even more special. It's better to return to the simpler handling for
>> now, as migration modes will be further modified later in the series.
>>
>> The second change is that once both reclaim and compaction declare it's not
>> worth to retry the reclaim/compact loop, there is no final compaction attempt.
>> As argued above, this is intentional. If that final compaction were to succeed,
>> it would be due to a wrong retry decision, or simply a race with somebody else
>> freeing memory for us.
>>
>> The main outcome of this patch should be simpler code. Logically, the initial
>> compaction without reclaim is the exceptional case to the reclaim/compaction
>> scheme, but prior to the patch, it was the last loop iteration that was
>> exceptional. Now the code matches the logic better. The change also enable the
>> following patches.
>>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>
> Other than the above thing I like this patch.
> Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 06/13] mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
  2016-05-12 16:20   ` Michal Hocko
@ 2016-05-13  8:23     ` Vlastimil Babka
  2016-05-13 12:05       ` Michal Hocko
  0 siblings, 1 reply; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-13  8:23 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On 05/12/2016 06:20 PM, Michal Hocko wrote:
> On Tue 10-05-16 09:35:56, Vlastimil Babka wrote:
> [...]
>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>> index 570383a41853..0cb09714d960 100644
>> --- a/include/linux/gfp.h
>> +++ b/include/linux/gfp.h
>> @@ -256,8 +256,7 @@ struct vm_area_struct;
>>   #define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
>>   #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
>>   #define GFP_TRANSHUGE	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
>> -			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
>> -			 ~__GFP_RECLAIM)
>> +			 __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)
>
> I am not sure this is the right thing to do. I think we should keep
> __GFP_NORETRY and clear it where we want a stronger semantic. This is
> just too suble that all callsites are doing the right thing.

That would complicate alloc_hugepage_direct_gfpmask() a bit, but if you 
think it's worth it, I can turn the default around, OK.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 04/13] mm, page_alloc: restructure direct compaction handling in slowpath
  2016-05-13  8:10     ` Vlastimil Babka
@ 2016-05-13  8:31       ` Michal Hocko
  0 siblings, 0 replies; 61+ messages in thread
From: Michal Hocko @ 2016-05-13  8:31 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds, Hugh Dickins

On Fri 13-05-16 10:10:50, Vlastimil Babka wrote:
> On 05/12/2016 03:29 PM, Michal Hocko wrote:
> > On Tue 10-05-16 09:35:54, Vlastimil Babka wrote:
> > > This patch attempts to restructure the code with only minimal functional
> > > changes. The call to the first compaction and THP-specific checks are now
> > > placed above the retry loop, and the "noretry" direct compaction is removed.
> > > 
> > > The initial compaction is additionally restricted only to costly orders, as we
> > > can expect smaller orders to be held back by watermarks, and only larger orders
> > > to suffer primarily from fragmentation. This better matches the checks in
> > > reclaim's shrink_zones().
> > > 
> > > There are two other smaller functional changes. One is that the upgrade from
> > > async migration to light sync migration will always occur after the initial
> > > compaction.
> > 
> > I do not think this belongs to the patch. There are two reasons. First
> > we do not need to do potentially more expensive sync mode when async is
> > able to make some progress and the second
> 
> My concern was that __GFP_NORETRY non-costly allocations wouldn't otherwise
> get a MIGRATE_SYNC_LIGHT pass at all. Previously they would get it in the
> noretry: label.

OK, I haven't considered this. So scratch this then.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 06/13] mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
  2016-05-13  8:23     ` Vlastimil Babka
@ 2016-05-13 12:05       ` Michal Hocko
  2016-05-18 11:59         ` Vlastimil Babka
  0 siblings, 1 reply; 61+ messages in thread
From: Michal Hocko @ 2016-05-13 12:05 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On Fri 13-05-16 10:23:31, Vlastimil Babka wrote:
> On 05/12/2016 06:20 PM, Michal Hocko wrote:
> > On Tue 10-05-16 09:35:56, Vlastimil Babka wrote:
> > [...]
> > > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > > index 570383a41853..0cb09714d960 100644
> > > --- a/include/linux/gfp.h
> > > +++ b/include/linux/gfp.h
> > > @@ -256,8 +256,7 @@ struct vm_area_struct;
> > >   #define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
> > >   #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
> > >   #define GFP_TRANSHUGE	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
> > > -			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
> > > -			 ~__GFP_RECLAIM)
> > > +			 __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)
> > 
> > I am not sure this is the right thing to do. I think we should keep
> > __GFP_NORETRY and clear it where we want a stronger semantic. This is
> > just too suble that all callsites are doing the right thing.
> 
> That would complicate alloc_hugepage_direct_gfpmask() a bit, but if you
> think it's worth it, I can turn the default around, OK.

Hmm, on the other hand it is true that GFP_TRANSHUGE is clearing both
reclaim flags by default and then overwrites that. This is just too
ugly. Can we make GFP_TRANSHUGE to only define flags we care about and
then tweak those that should go away at the callsites which matter now
that we do not rely on is_thp_gfp_mask?

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 07/13] mm, compaction: introduce direct compaction priority
  2016-05-10  7:35 ` [RFC 07/13] mm, compaction: introduce direct compaction priority Vlastimil Babka
@ 2016-05-13 12:37   ` Michal Hocko
  0 siblings, 0 replies; 61+ messages in thread
From: Michal Hocko @ 2016-05-13 12:37 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On Tue 10-05-16 09:35:57, Vlastimil Babka wrote:
> In the context of direct compaction, for some types of allocations we would
> like the compaction to either succeed or definitely fail while trying as hard
> as possible. Current async/sync_light migration mode is insufficient, as there
> are heuristics such as caching scanner positions, marking pageblocks as
> unsuitable or deferring compaction for a zone. At least the final compaction
> attempt should be able to override these heuristics.
> 
> To communicate how hard compaction should try, we replace migration mode with
> a new enum compact_priority and change the relevant function signatures. In
> compact_zone_order() where struct compact_control is constructed, the priority
> is mapped to suitable control flags. This patch itself has no functional
> change, as the current priority levels are mapped back to the same migration
> modes as before. Expanding them will be done next.
> 
> Note that !CONFIG_COMPACTION variant of try_to_compact_pages() is removed, as
> the only caller exists under CONFIG_COMPACTION.

Your s-o-b is missing

Anyway I like the idea. The migration_mode felt really weird. It exposes
an internal detail of the compaction code which should have no business
in the allocator path.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/compaction.h | 18 +++++++++---------
>  mm/compaction.c            | 14 ++++++++------
>  mm/page_alloc.c            | 27 +++++++++++++--------------
>  3 files changed, 30 insertions(+), 29 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 4ba90e74969c..900d181ff1b0 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -1,6 +1,14 @@
>  #ifndef _LINUX_COMPACTION_H
>  #define _LINUX_COMPACTION_H
>  
> +// TODO: lower value means higher priority to match reclaim, makes sense?

Yes this makes sense to me.

> +enum compact_priority {

enums might be tricky but I guess it should work ok here. I would just
add

	COMPACT_MIN_PRIO,
> +	COMPACT_PRIO_SYNC_LIGHT = COMPACT_MIN_PRIO,
> +	DEF_COMPACT_PRIORITY = COMPACT_PRIO_SYNC_LIGHT,
> +	COMPACT_PRIO_ASYNC,
> +	INIT_COMPACT_PRIORITY = COMPACT_PRIO_ASYNC
> +};
> +

to make an implementation independent lowest priority.

[...]

> @@ -3269,11 +3269,11 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
>  	/*
>  	 * compaction considers all the zone as desperately out of memory
>  	 * so it doesn't really make much sense to retry except when the
> -	 * failure could be caused by weak migration mode.
> +	 * failure could be caused by insufficient priority
>  	 */
>  	if (compaction_failed(compact_result)) {
> -		if (*migrate_mode == MIGRATE_ASYNC) {
> -			*migrate_mode = MIGRATE_SYNC_LIGHT;
> +		if (*compact_priority > 0) {

		if (*compact_priority > COMPACT_MIN_PRIO)

> +			(*compact_priority)--;
>  			return true;
>  		}
>  		return false;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 08/13] mm, compaction: simplify contended compaction handling
  2016-05-10  7:35 ` [RFC 08/13] mm, compaction: simplify contended compaction handling Vlastimil Babka
@ 2016-05-13 13:09   ` Michal Hocko
  2016-05-16  7:10     ` Vlastimil Babka
  0 siblings, 1 reply; 61+ messages in thread
From: Michal Hocko @ 2016-05-13 13:09 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On Tue 10-05-16 09:35:58, Vlastimil Babka wrote:
> Async compaction detects contention either due to failing trylock on zone->lock
> or lru_lock, or by need_resched(). Since 1f9efdef4f3f ("mm, compaction:
> khugepaged should not give up due to need_resched()") the code got quite
> complicated to distinguish these two up to the __alloc_pages_slowpath() level,
> so different decisions could be taken for khugepaged allocations.
> 
> After the recent changes, khugepaged allocations don't check for contended
> compaction anymore, so we again don't need to distinguish lock and sched
> contention, and simplify the current convoluted code a lot.
> 
> However, I believe it's also possible to simplify even more and completely
> remove the check for contended compaction after the initial async compaction
> for costly orders, which was originally aimed at THP page fault allocations.
> There are several reasons why this can be done now:
> 
> - with the new defaults, THP page faults no longer do reclaim/compaction at
>   all, unless the system admin has overriden the default, or application has
>   indicated via madvise that it can benefit from THP's. In both cases, it
>   means that the potential extra latency is expected and worth the benefits.

Yes this sounds reasonable to me. Especially when we consider the code bloat
size this is causing.

> - even if reclaim/compaction proceeds after this patch where it previously
>   wouldn't, the second compaction attempt is still async and will detect the
>   contention and back off, if the contention persists

MIGRATE_ASYNC still backs off after this patch so I would be surprise to
see more latency issues from this change.

> - there are still heuristics like deferred compaction and pageblock skip bits
>   in place that prevent excessive THP page fault latencies
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

I hope I haven't missed anything because the compaction is full of
subtle traps but this seems the changes seem ok to me.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/compaction.h | 10 +------
>  mm/compaction.c            | 72 +++++++++-------------------------------------
>  mm/internal.h              |  5 +---
>  mm/page_alloc.c            | 28 +-----------------
>  4 files changed, 16 insertions(+), 99 deletions(-)

This is really nice cleanup considering it doesn't introduce big
behavior changes which is my understanding from the code.

[...]
> @@ -1564,14 +1564,11 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
>  	trace_mm_compaction_end(start_pfn, cc->migrate_pfn,
>  				cc->free_pfn, end_pfn, sync, ret);
>  
> -	if (ret == COMPACT_CONTENDED)
> -		ret = COMPACT_PARTIAL;
> -
>  	return ret;
>  }

This took me a while to grasp but then I realized this is correct
because we shouldn't pretend progress when there was none in fact,
especially when __alloc_pages_direct_compact basically replaced this
"fake" COMPACT_PARTIAL by COMPACT_CONTENDED anyway.

[...]
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 09/13] mm, compaction: make whole_zone flag ignore cached scanner positions
  2016-05-10  7:35 ` [RFC 09/13] mm, compaction: make whole_zone flag ignore cached scanner positions Vlastimil Babka
@ 2016-05-13 13:23   ` Michal Hocko
  0 siblings, 0 replies; 61+ messages in thread
From: Michal Hocko @ 2016-05-13 13:23 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On Tue 10-05-16 09:35:59, Vlastimil Babka wrote:
> A recent patch has added whole_zone flag that compaction sets when scanning
> starts from the zone boundary, in order to report that zone has been fully
> scanned in one attempt. For allocations that want to try really hard or cannot
> fail, we will want to introduce a mode where scanning whole zone is guaranteed
> regardless of the cached positions.
> 
> This patch reuses the whole_zone flag in a way that if it's already passed true
> to compaction, the cached scanner positions are ignored. Employing this flag
> during reclaim/compaction loop will be done in the next patch. This patch
> however converts compaction invoked from userspace via procfs to use this flag.
> Before this patch, the cached positions were first reset to zone boundaries and
> then read back from struct zone, so there was a window where a parallel
> compaction could replace the reset values, making the manual compaction less
> effective. Using the flag instead of performing reset is more robust.

This makes perfect sense because user visible API should better have a
deterministic behavior. Playing scanner positions games which might race
with the system triggered compaction sounds quite unpredictable.

> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  mm/compaction.c | 15 +++++----------
>  mm/internal.h   |  2 +-
>  2 files changed, 6 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index f649c7bc6de5..1ce6783d3ead 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -1442,11 +1442,13 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
>  	 */
>  	cc->migrate_pfn = zone->compact_cached_migrate_pfn[sync];
>  	cc->free_pfn = zone->compact_cached_free_pfn;
> -	if (cc->free_pfn < start_pfn || cc->free_pfn >= end_pfn) {
> +	if (cc->whole_zone || cc->free_pfn < start_pfn ||
> +						cc->free_pfn >= end_pfn) {
>  		cc->free_pfn = pageblock_start_pfn(end_pfn - 1);
>  		zone->compact_cached_free_pfn = cc->free_pfn;
>  	}
> -	if (cc->migrate_pfn < start_pfn || cc->migrate_pfn >= end_pfn) {
> +	if (cc->whole_zone || cc->migrate_pfn < start_pfn ||
> +						cc->migrate_pfn >= end_pfn) {
>  		cc->migrate_pfn = start_pfn;
>  		zone->compact_cached_migrate_pfn[0] = cc->migrate_pfn;
>  		zone->compact_cached_migrate_pfn[1] = cc->migrate_pfn;
> @@ -1693,14 +1695,6 @@ static void __compact_pgdat(pg_data_t *pgdat, struct compact_control *cc)
>  		INIT_LIST_HEAD(&cc->freepages);
>  		INIT_LIST_HEAD(&cc->migratepages);
>  
> -		/*
> -		 * When called via /proc/sys/vm/compact_memory
> -		 * this makes sure we compact the whole zone regardless of
> -		 * cached scanner positions.
> -		 */
> -		if (is_via_compact_memory(cc->order))
> -			__reset_isolation_suitable(zone);
> -
>  		if (is_via_compact_memory(cc->order) ||
>  				!compaction_deferred(zone, cc->order))
>  			compact_zone(zone, cc);
> @@ -1736,6 +1730,7 @@ static void compact_node(int nid)
>  		.order = -1,
>  		.mode = MIGRATE_SYNC,
>  		.ignore_skip_hint = true,
> +		.whole_zone = true,
>  	};
>  
>  	__compact_pgdat(NODE_DATA(nid), &cc);
> diff --git a/mm/internal.h b/mm/internal.h
> index 556bc9d0a817..2acdee8ab0e6 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -178,7 +178,7 @@ struct compact_control {
>  	enum migrate_mode mode;		/* Async or sync migration mode */
>  	bool ignore_skip_hint;		/* Scan blocks even if marked skip */
>  	bool direct_compaction;		/* False from kcompactd or /proc/... */
> -	bool whole_zone;		/* Whole zone has been scanned */
> +	bool whole_zone;		/* Whole zone should/has been scanned */
>  	int order;			/* order a direct compactor needs */
>  	const gfp_t gfp_mask;		/* gfp mask of a direct compactor */
>  	const unsigned int alloc_flags;	/* alloc flags of a direct compactor */
> -- 
> 2.8.2
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 11/13] mm, compaction: add the ultimate direct compaction priority
  2016-05-10  7:36 ` [RFC 11/13] mm, compaction: add the ultimate direct compaction priority Vlastimil Babka
@ 2016-05-13 13:38   ` Michal Hocko
  2016-05-16  7:17     ` Vlastimil Babka
  0 siblings, 1 reply; 61+ messages in thread
From: Michal Hocko @ 2016-05-13 13:38 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On Tue 10-05-16 09:36:01, Vlastimil Babka wrote:
> During reclaim/compaction loop, it's desirable to get a final answer from
> unsuccessful compaction so we can either fail the allocation or invoke the OOM
> killer. However, heuristics such as deferred compaction or pageblock skip bits
> can cause compaction to skip parts or whole zones and lead to premature OOM's,
> failures or excessive reclaim/compaction retries.
> 
> To remedy this, we introduce a new direct compaction priority called
> COMPACT_PRIO_SYNC_FULL, which instructs direct compaction to:
> 
> - ignore deferred compaction status for a zone
> - ignore pageblock skip hints
> - ignore cached scanner positions and scan the whole zone
> - use MIGRATE_SYNC migration mode

I do not think we can do MIGRATE_SYNC because fallback_migrate_page
would trigger pageout and we are in the allocation path and so we
could blow up the stack.

> The new priority should get eventually picked up by should_compact_retry() and
> this should improve success rates for costly allocations using __GFP_RETRY,

s@__GFP_RETRY@__GFP_REPEAT@

> such as hugetlbfs allocations, and reduce some corner-case OOM's for non-costly
> allocations.

My testing has shown that even with the current implementation with
deferring, skip hints and cached positions had (close to) 100% success
rate even with close to OOM conditions.

I am wondering whether this strongest priority should be done only for
!costly high order pages. But we probably want less special cases
between costly and !costly orders.

> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
>  include/linux/compaction.h |  1 +
>  mm/compaction.c            | 15 ++++++++++++---
>  2 files changed, 13 insertions(+), 3 deletions(-)
> 
[...]
> @@ -1631,7 +1639,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
>  								ac->nodemask) {
>  		enum compact_result status;
>  
> -		if (compaction_deferred(zone, order)) {
> +		if (prio > COMPACT_PRIO_SYNC_FULL
> +					&& compaction_deferred(zone, order)) {
>  			rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
>  			continue;
>  		}

Wouldn't it be better to pull the prio check into compaction_deferred
directly? There are more callers and I am not really sure all of them
would behave consistently.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 12/13] mm, compaction: more reliably increase direct compaction priority
  2016-05-10  7:36 ` [RFC 12/13] mm, compaction: more reliably increase " Vlastimil Babka
  2016-05-10 12:55   ` Vlastimil Babka
@ 2016-05-13 14:15   ` Michal Hocko
  2016-05-16  7:31     ` Vlastimil Babka
  2016-05-31  6:37   ` Joonsoo Kim
  2 siblings, 1 reply; 61+ messages in thread
From: Michal Hocko @ 2016-05-13 14:15 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On Tue 10-05-16 09:36:02, Vlastimil Babka wrote:
> During reclaim/compaction loop, compaction priority can be increased by the
> should_compact_retry() function, but the current code is not optimal for
> several reasons:
> 
> - priority is only increased when compaction_failed() is true, which means
>   that compaction has scanned the whole zone. This may not happen even after
>   multiple attempts with the lower priority due to parallel activity, so we
>   might needlessly struggle on the lower priority.

OK, I can see that this can be changed if we have a guarantee that at
least one full round is guaranteed. Which seems to be the case for the
lowest priority.

> 
> - should_compact_retry() is only called when should_reclaim_retry() returns
>   false. This means that compaction priority cannot get increased as long
>   as reclaim makes sufficient progress. Theoretically, reclaim should stop
>   retrying for high-order allocations as long as the high-order page doesn't
>   exist but due to races, this may result in spurious retries when the
>   high-order page momentarily does exist.

This is intentional behavior and I would like to preserve it if it is
possible. For higher order pages should_reclaim_retry retries as long
as there are some eligible high order pages present which are just hidden
by the watermark check. So this is mostly to get us over watermarks to
start carrying about fragmentation. If we race there then nothing really
terrible should happen and we should eventually converge to a terminal
state.

Does this make sense to you?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 08/13] mm, compaction: simplify contended compaction handling
  2016-05-13 13:09   ` Michal Hocko
@ 2016-05-16  7:10     ` Vlastimil Babka
  0 siblings, 0 replies; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-16  7:10 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On 05/13/2016 03:09 PM, Michal Hocko wrote:
>> >@@ -1564,14 +1564,11 @@ static enum compact_result compact_zone(struct zone *zone, struct compact_contro
>> >  	trace_mm_compaction_end(start_pfn, cc->migrate_pfn,
>> >  				cc->free_pfn, end_pfn, sync, ret);
>> >
>> >-	if (ret == COMPACT_CONTENDED)
>> >-		ret = COMPACT_PARTIAL;
>> >-
>> >  	return ret;
>> >  }
> This took me a while to grasp but then I realized this is correct
> because we shouldn't pretend progress when there was none in fact,
> especially when __alloc_pages_direct_compact basically replaced this
> "fake" COMPACT_PARTIAL by COMPACT_CONTENDED anyway.

Yes. Actually COMPACT_CONTENDED compact_result used to be just for the 
tracepoint, and __alloc_pages_direct_compact used another function 
parameter to signal contention. You changed it with the oom rework so 
COMPACT_CONTENDED result value was used, so this hunk just makes sure 
it's still reported correctly.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 11/13] mm, compaction: add the ultimate direct compaction priority
  2016-05-13 13:38   ` Michal Hocko
@ 2016-05-16  7:17     ` Vlastimil Babka
  2016-05-16  8:11       ` Michal Hocko
  2016-05-18 12:46       ` Vlastimil Babka
  0 siblings, 2 replies; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-16  7:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On 05/13/2016 03:38 PM, Michal Hocko wrote:
> On Tue 10-05-16 09:36:01, Vlastimil Babka wrote:
>> During reclaim/compaction loop, it's desirable to get a final answer from
>> unsuccessful compaction so we can either fail the allocation or invoke the OOM
>> killer. However, heuristics such as deferred compaction or pageblock skip bits
>> can cause compaction to skip parts or whole zones and lead to premature OOM's,
>> failures or excessive reclaim/compaction retries.
>>
>> To remedy this, we introduce a new direct compaction priority called
>> COMPACT_PRIO_SYNC_FULL, which instructs direct compaction to:
>>
>> - ignore deferred compaction status for a zone
>> - ignore pageblock skip hints
>> - ignore cached scanner positions and scan the whole zone
>> - use MIGRATE_SYNC migration mode
>
> I do not think we can do MIGRATE_SYNC because fallback_migrate_page
> would trigger pageout and we are in the allocation path and so we
> could blow up the stack.

Ah, I thought it was just waiting for the writeout to complete, and you 
wanted to introduce another migrate mode to actually do the writeout. 
But looks like I misremembered.

>> The new priority should get eventually picked up by should_compact_retry() and
>> this should improve success rates for costly allocations using __GFP_RETRY,
>
> s@__GFP_RETRY@__GFP_REPEAT@

Ah thanks. Depending on the patch timing it might be __GFP_RETRY_HARD in 
the end, right :)

>> such as hugetlbfs allocations, and reduce some corner-case OOM's for non-costly
>> allocations.
>
> My testing has shown that even with the current implementation with
> deferring, skip hints and cached positions had (close to) 100% success
> rate even with close to OOM conditions.

Hmm, I thought you at one point said that ignoring skip hints was a 
large improvement, because the current resetting of them is just too fuzzy.

> I am wondering whether this strongest priority should be done only for
> !costly high order pages. But we probably want less special cases
> between costly and !costly orders.

Yeah, if somebody wants to retry hard, let him.

>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>
> Acked-by: Michal Hocko <mhocko@suse.com>
>
>> ---
>>   include/linux/compaction.h |  1 +
>>   mm/compaction.c            | 15 ++++++++++++---
>>   2 files changed, 13 insertions(+), 3 deletions(-)
>>
> [...]
>> @@ -1631,7 +1639,8 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
>>   								ac->nodemask) {
>>   		enum compact_result status;
>>
>> -		if (compaction_deferred(zone, order)) {
>> +		if (prio > COMPACT_PRIO_SYNC_FULL
>> +					&& compaction_deferred(zone, order)) {
>>   			rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
>>   			continue;
>>   		}
>
> Wouldn't it be better to pull the prio check into compaction_deferred
> directly? There are more callers and I am not really sure all of them
> would behave consistently.

I'll check, thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 12/13] mm, compaction: more reliably increase direct compaction priority
  2016-05-13 14:15   ` Michal Hocko
@ 2016-05-16  7:31     ` Vlastimil Babka
  2016-05-16  8:14       ` Michal Hocko
  0 siblings, 1 reply; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-16  7:31 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On 05/13/2016 04:15 PM, Michal Hocko wrote:
> On Tue 10-05-16 09:36:02, Vlastimil Babka wrote:
>>
>> - should_compact_retry() is only called when should_reclaim_retry() returns
>>    false. This means that compaction priority cannot get increased as long
>>    as reclaim makes sufficient progress. Theoretically, reclaim should stop
>>    retrying for high-order allocations as long as the high-order page doesn't
>>    exist but due to races, this may result in spurious retries when the
>>    high-order page momentarily does exist.
>
> This is intentional behavior and I would like to preserve it if it is
> possible. For higher order pages should_reclaim_retry retries as long
> as there are some eligible high order pages present which are just hidden
> by the watermark check. So this is mostly to get us over watermarks to
> start carrying about fragmentation. If we race there then nothing really
> terrible should happen and we should eventually converge to a terminal
> state.
>
> Does this make sense to you?

Yeah it should work, my only worry was that this may get subtly wrong 
(as experience shows us) and due to e.g. slightly different watermark 
checks and/or a corner-case zone such as ZONE_DMA, 
should_reclaim_retry() would keep returning true, even if reclaim 
couldn't/wouldn't help anything. Then compaction would be needlessly 
kept at ineffective priority.

Also my understanding of the initial compaction priorities is to lower 
the latency if fragmentation is just light and there's enough memory. 
Once we start struggling, I don't see much point in not switching to the 
full compaction priority quickly.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 11/13] mm, compaction: add the ultimate direct compaction priority
  2016-05-16  7:17     ` Vlastimil Babka
@ 2016-05-16  8:11       ` Michal Hocko
  2016-05-18 12:46       ` Vlastimil Babka
  1 sibling, 0 replies; 61+ messages in thread
From: Michal Hocko @ 2016-05-16  8:11 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On Mon 16-05-16 09:17:11, Vlastimil Babka wrote:
> On 05/13/2016 03:38 PM, Michal Hocko wrote:
> > On Tue 10-05-16 09:36:01, Vlastimil Babka wrote:
[...]
> > > such as hugetlbfs allocations, and reduce some corner-case OOM's for non-costly
> > > allocations.
> > 
> > My testing has shown that even with the current implementation with
> > deferring, skip hints and cached positions had (close to) 100% success
> > rate even with close to OOM conditions.
> 
> Hmm, I thought you at one point said that ignoring skip hints was a large
> improvement, because the current resetting of them is just too fuzzy.

Not in the hugetlb test. But you are right that skip hints resulted in
really fuzzy behavior.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 12/13] mm, compaction: more reliably increase direct compaction priority
  2016-05-16  7:31     ` Vlastimil Babka
@ 2016-05-16  8:14       ` Michal Hocko
  2016-05-16  9:27         ` Vlastimil Babka
  0 siblings, 1 reply; 61+ messages in thread
From: Michal Hocko @ 2016-05-16  8:14 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On Mon 16-05-16 09:31:44, Vlastimil Babka wrote:
> On 05/13/2016 04:15 PM, Michal Hocko wrote:
> > On Tue 10-05-16 09:36:02, Vlastimil Babka wrote:
> > > 
> > > - should_compact_retry() is only called when should_reclaim_retry() returns
> > >    false. This means that compaction priority cannot get increased as long
> > >    as reclaim makes sufficient progress. Theoretically, reclaim should stop
> > >    retrying for high-order allocations as long as the high-order page doesn't
> > >    exist but due to races, this may result in spurious retries when the
> > >    high-order page momentarily does exist.
> > 
> > This is intentional behavior and I would like to preserve it if it is
> > possible. For higher order pages should_reclaim_retry retries as long
> > as there are some eligible high order pages present which are just hidden
> > by the watermark check. So this is mostly to get us over watermarks to
> > start carrying about fragmentation. If we race there then nothing really
> > terrible should happen and we should eventually converge to a terminal
> > state.
> > 
> > Does this make sense to you?
> 
> Yeah it should work, my only worry was that this may get subtly wrong (as
> experience shows us) and due to e.g. slightly different watermark checks
> and/or a corner-case zone such as ZONE_DMA, should_reclaim_retry() would
> keep returning true, even if reclaim couldn't/wouldn't help anything. Then
> compaction would be needlessly kept at ineffective priority.

watermark check for ZONE_DMA should always fail because it fails even
when is completely free to the lowmem reserves. I had a subtle bug in
the original code to check highzone_idx rather than classzone_idx but
that should the fix has been posted recently:
http://lkml.kernel.org/r/1463051677-29418-2-git-send-email-mhocko@kernel.org

> Also my understanding of the initial compaction priorities is to lower the
> latency if fragmentation is just light and there's enough memory. Once we
> start struggling, I don't see much point in not switching to the full
> compaction priority quickly.

That is true but why to compact when there are high order pages and they
are just hidden by the watermark check.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 13/13] mm, compaction: fix and improve watermark handling
  2016-05-10  7:36 ` [RFC 13/13] mm, compaction: fix and improve watermark handling Vlastimil Babka
@ 2016-05-16  9:25   ` Michal Hocko
  2016-05-16  9:50     ` Vlastimil Babka
  2016-05-18 13:50     ` Mel Gorman
  0 siblings, 2 replies; 61+ messages in thread
From: Michal Hocko @ 2016-05-16  9:25 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On Tue 10-05-16 09:36:03, Vlastimil Babka wrote:
> Compaction has been using watermark checks when deciding whether it was
> successful, and whether compaction is at all suitable. There are few problems
> with these checks.
> 
> - __compact_finished() uses low watermark in a check that has to pass if
>   the direct compaction is to finish and allocation should succeed. This is
>   too pessimistic, as the allocation will typically use min watermark. It
>   may happen that during compaction, we drop below the low watermark (due to
>   parallel activity), but still form the target high-order page. By checking
>   against low watermark, we might needlessly continue compaction. After this
>   patch, the check uses direct compactor's alloc_flags to determine the
>   watermark, which is effectively the min watermark.

OK, this makes some sense. It would be great if we could have at least
some clarification why the low wmark has been used previously. Probably
Mel can remember?

> - __compaction_suitable has the same issue in the check whether the allocation
>   is already supposed to succeed and we don't need to compact. Fix it the same
>   way.
>
> - __compaction_suitable() then checks the low watermark plus a (2 << order) gap
>   to decide if there's enough free memory to perform compaction. This check

And this was a real head scratcher when I started looking into the
compaction recently. Why do we need to be above low watermark to even
start compaction. Compaction uses additional memory only for a short
period of time and then releases the already migrated pages.

>   uses direct compactor's alloc_flags, but that's wrong. If alloc_flags doesn't
>   include ALLOC_CMA, we might fail the check, even though the freepage
>   isolation isn't restricted outside of CMA pageblocks. On the other hand,
>   alloc_flags may indicate access to memory reserves, making compaction proceed
>   and then fail watermark check during freepage isolation, which doesn't pass
>   alloc_flags. The fix here is to use fixed ALLOC_CMA flags in the
>   __compaction_suitable() check.

This makes my head hurt. Whut?

> - __isolate_free_page uses low watermark check to decide if free page can be
>   isolated. It also doesn't use ALLOC_CMA, so add it for the same reasons.

Why do we check the watermark at all? What would happen if this obscure
if (!is_migrate_isolate(mt)) was gone? I remember I put some tracing
there and it never hit for me even when I was testing close to OOM
conditions. Maybe an earlier check bailed out but this code path looks
really obscure so it should either deserve a large fat comment or to
die.

> - The use of low watermark checks in __compaction_suitable() and
>   __isolate_free_page does perhaps make sense for high-order allocations where
>   more freepages increase the chance of success, and we can typically fail
>   with some order-0 fallback when the system is struggling. But for low-order
>   allocation, forming the page should not be that hard. So using low watermark
>   here might just prevent compaction from even trying, and eventually lead to
>   OOM killer even if we are above min watermarks. So after this patch, we use
>   min watermark for non-costly orders in these checks, by passing the
>   alloc_flags parameter to split_page() and __isolate_free_page().

OK, so if IIUC costly high order requests even shouldn't try when we are
below watermark (unless they are __GFP_REPEAT which would get us to a
stronger compaction mode/priority) and that would reclaim us over low
wmark and go on. Is that what you are saying? This makes some sense but
then let's have a _single_ place to check the watermak please. This
checks at few different levels is just subtle as hell and error prone
likewise.

> To sum up, after this patch, the kernel should in some situations finish
> successful direct compaction sooner, prevent compaction from starting when it's
> not needed, proceed with compaction when free memory is in CMA pageblocks, and
> for non-costly orders, prevent OOM killing or excessive reclaim when free
> memory is between the min and low watermarks.

Could you please split this patch into three(?) parts. One to remove as many
wmark checks as possible, move low wmark to min for !costly high orders
and finally the cma part which I fail to understand...

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 12/13] mm, compaction: more reliably increase direct compaction priority
  2016-05-16  8:14       ` Michal Hocko
@ 2016-05-16  9:27         ` Vlastimil Babka
  2016-05-16  9:52           ` Michal Hocko
  0 siblings, 1 reply; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-16  9:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On 05/16/2016 10:14 AM, Michal Hocko wrote:
> On Mon 16-05-16 09:31:44, Vlastimil Babka wrote:
>>
>> Yeah it should work, my only worry was that this may get subtly wrong (as
>> experience shows us) and due to e.g. slightly different watermark checks
>> and/or a corner-case zone such as ZONE_DMA, should_reclaim_retry() would
>> keep returning true, even if reclaim couldn't/wouldn't help anything. Then
>> compaction would be needlessly kept at ineffective priority.
>
> watermark check for ZONE_DMA should always fail because it fails even
> when is completely free to the lowmem reserves. I had a subtle bug in
> the original code to check highzone_idx rather than classzone_idx but
> that should the fix has been posted recently:
> http://lkml.kernel.org/r/1463051677-29418-2-git-send-email-mhocko@kernel.org

Sure, but that just adds to the experience of being subtly wrong in this 
area :) But sure we can leave this part alone until proven wrong, I 
don't insist strongly.

>> Also my understanding of the initial compaction priorities is to lower the
>> latency if fragmentation is just light and there's enough memory. Once we
>> start struggling, I don't see much point in not switching to the full
>> compaction priority quickly.
>
> That is true but why to compact when there are high order pages and they
> are just hidden by the watermark check.

Compaction should skip such zone regardless of priority.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 13/13] mm, compaction: fix and improve watermark handling
  2016-05-16  9:25   ` Michal Hocko
@ 2016-05-16  9:50     ` Vlastimil Babka
  2016-05-16 12:30       ` Michal Hocko
  2016-05-18 13:50     ` Mel Gorman
  1 sibling, 1 reply; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-16  9:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On 05/16/2016 11:25 AM, Michal Hocko wrote:
> On Tue 10-05-16 09:36:03, Vlastimil Babka wrote:
>> Compaction has been using watermark checks when deciding whether it was
>> successful, and whether compaction is at all suitable. There are few problems
>> with these checks.
>>
>> - __compact_finished() uses low watermark in a check that has to pass if
>>    the direct compaction is to finish and allocation should succeed. This is
>>    too pessimistic, as the allocation will typically use min watermark. It
>>    may happen that during compaction, we drop below the low watermark (due to
>>    parallel activity), but still form the target high-order page. By checking
>>    against low watermark, we might needlessly continue compaction. After this
>>    patch, the check uses direct compactor's alloc_flags to determine the
>>    watermark, which is effectively the min watermark.
>
> OK, this makes some sense. It would be great if we could have at least
> some clarification why the low wmark has been used previously. Probably
> Mel can remember?
>
>> - __compaction_suitable has the same issue in the check whether the allocation
>>    is already supposed to succeed and we don't need to compact. Fix it the same
>>    way.
>>
>> - __compaction_suitable() then checks the low watermark plus a (2 << order) gap
>>    to decide if there's enough free memory to perform compaction. This check
>
> And this was a real head scratcher when I started looking into the
> compaction recently. Why do we need to be above low watermark to even
> start compaction.

Hmm, above you said you're fine with low wmark (maybe after 
clarification). I don't know why it was used, can only guess.

> Compaction uses additional memory only for a short
> period of time and then releases the already migrated pages.

As for the 2 << order gap. I can imagine that e.g. order-5 compaction 
(32 pages) isolates 20 pages for migration and starts looking for free 
pages. It collects 19 free pages and then reaches an order-4 free page. 
Splitting that page to collect it would result in 19+16=35 pages 
isolated, thus exceed the 1 << order gap, and fail. With 2 << order gap, 
chances of this happening are reduced.

>>    uses direct compactor's alloc_flags, but that's wrong. If alloc_flags doesn't
>>    include ALLOC_CMA, we might fail the check, even though the freepage
>>    isolation isn't restricted outside of CMA pageblocks. On the other hand,
>>    alloc_flags may indicate access to memory reserves, making compaction proceed
>>    and then fail watermark check during freepage isolation, which doesn't pass
>>    alloc_flags. The fix here is to use fixed ALLOC_CMA flags in the
>>    __compaction_suitable() check.
>
> This makes my head hurt. Whut?

I'll try to explain better next time.

>> - __isolate_free_page uses low watermark check to decide if free page can be
>>    isolated. It also doesn't use ALLOC_CMA, so add it for the same reasons.
>
> Why do we check the watermark at all? What would happen if this obscure
> if (!is_migrate_isolate(mt)) was gone? I remember I put some tracing
> there and it never hit for me even when I was testing close to OOM
> conditions. Maybe an earlier check bailed out but this code path looks
> really obscure so it should either deserve a large fat comment or to
> die.

The check is there so that compaction doesn't exhaust memory below 
reserves during its work, just like any other non-privileged allocation.

>> - The use of low watermark checks in __compaction_suitable() and
>>    __isolate_free_page does perhaps make sense for high-order allocations where
>>    more freepages increase the chance of success, and we can typically fail
>>    with some order-0 fallback when the system is struggling. But for low-order
>>    allocation, forming the page should not be that hard. So using low watermark
>>    here might just prevent compaction from even trying, and eventually lead to
>>    OOM killer even if we are above min watermarks. So after this patch, we use
>>    min watermark for non-costly orders in these checks, by passing the
>>    alloc_flags parameter to split_page() and __isolate_free_page().
>
> OK, so if IIUC costly high order requests even shouldn't try when we are
> below watermark (unless they are __GFP_REPEAT which would get us to a
> stronger compaction mode/priority) and that would reclaim us over low
> wmark and go on. Is that what you are saying? This makes some sense but
> then let's have a _single_ place to check the watermak please. This
> checks at few different levels is just subtle as hell and error prone
> likewise.

What single place then? The situation might change dynamically so 
passing the initial __compaction_suitable() check doesn't guarantee that 
enough free pages are still available when it comes to isolating 
freepages. Your testing that never hit it shows that this is rare, but 
do we want to risk compaction making an OOM situation worse?

>> To sum up, after this patch, the kernel should in some situations finish
>> successful direct compaction sooner, prevent compaction from starting when it's
>> not needed, proceed with compaction when free memory is in CMA pageblocks, and
>> for non-costly orders, prevent OOM killing or excessive reclaim when free
>> memory is between the min and low watermarks.
>
> Could you please split this patch into three(?) parts. One to remove as many
> wmark checks as possible, move low wmark to min for !costly high orders
> and finally the cma part which I fail to understand...

Sure, although I'm not yet convinced we can remove any checks.

> Thanks!
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 12/13] mm, compaction: more reliably increase direct compaction priority
  2016-05-16  9:27         ` Vlastimil Babka
@ 2016-05-16  9:52           ` Michal Hocko
  0 siblings, 0 replies; 61+ messages in thread
From: Michal Hocko @ 2016-05-16  9:52 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On Mon 16-05-16 11:27:56, Vlastimil Babka wrote:
> On 05/16/2016 10:14 AM, Michal Hocko wrote:
> > On Mon 16-05-16 09:31:44, Vlastimil Babka wrote:
[...]
> > > Also my understanding of the initial compaction priorities is to lower the
> > > latency if fragmentation is just light and there's enough memory. Once we
> > > start struggling, I don't see much point in not switching to the full
> > > compaction priority quickly.
> > 
> > That is true but why to compact when there are high order pages and they
> > are just hidden by the watermark check.
> 
> Compaction should skip such zone regardless of priority.

The point I've tried to raise is that we shouldn't conflate the purpose
of the two. The reclaim is here primarily to get us over the watermarks
while compaction is here to form high order pages. If we get both
together the distinction is blured which, I believe, will lead to more
complicated code in the end. I might be wrong here of course but let's
try to have compaction as much wmark check free as possible.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 13/13] mm, compaction: fix and improve watermark handling
  2016-05-16  9:50     ` Vlastimil Babka
@ 2016-05-16 12:30       ` Michal Hocko
  0 siblings, 0 replies; 61+ messages in thread
From: Michal Hocko @ 2016-05-16 12:30 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On Mon 16-05-16 11:50:22, Vlastimil Babka wrote:
> On 05/16/2016 11:25 AM, Michal Hocko wrote:
> > On Tue 10-05-16 09:36:03, Vlastimil Babka wrote:
> > > Compaction has been using watermark checks when deciding whether it was
> > > successful, and whether compaction is at all suitable. There are few problems
> > > with these checks.
> > > 
> > > - __compact_finished() uses low watermark in a check that has to pass if
> > >    the direct compaction is to finish and allocation should succeed. This is
> > >    too pessimistic, as the allocation will typically use min watermark. It
> > >    may happen that during compaction, we drop below the low watermark (due to
> > >    parallel activity), but still form the target high-order page. By checking
> > >    against low watermark, we might needlessly continue compaction. After this
> > >    patch, the check uses direct compactor's alloc_flags to determine the
> > >    watermark, which is effectively the min watermark.
> > 
> > OK, this makes some sense. It would be great if we could have at least
> > some clarification why the low wmark has been used previously. Probably
> > Mel can remember?
> > 
> > > - __compaction_suitable has the same issue in the check whether the allocation
> > >    is already supposed to succeed and we don't need to compact. Fix it the same
> > >    way.
> > > 
> > > - __compaction_suitable() then checks the low watermark plus a (2 << order) gap
> > >    to decide if there's enough free memory to perform compaction. This check
> > 
> > And this was a real head scratcher when I started looking into the
> > compaction recently. Why do we need to be above low watermark to even
> > start compaction.
> 
> Hmm, above you said you're fine with low wmark (maybe after clarification).
> I don't know why it was used, can only guess.

Yes I can imagine this would be a good backoff for costly orders without
__GFP_REPEAT.

> > Compaction uses additional memory only for a short
> > period of time and then releases the already migrated pages.
> 
> As for the 2 << order gap. I can imagine that e.g. order-5 compaction (32
> pages) isolates 20 pages for migration and starts looking for free pages. It
> collects 19 free pages and then reaches an order-4 free page. Splitting that
> page to collect it would result in 19+16=35 pages isolated, thus exceed the
> 1 << order gap, and fail. With 2 << order gap, chances of this happening are
> reduced.

OK, fair enough but that sounds like a case which is not worth optimize
and introduce a subtle code for.

[...]

> > > - __isolate_free_page uses low watermark check to decide if free page can be
> > >    isolated. It also doesn't use ALLOC_CMA, so add it for the same reasons.
> > 
> > Why do we check the watermark at all? What would happen if this obscure
> > if (!is_migrate_isolate(mt)) was gone? I remember I put some tracing
> > there and it never hit for me even when I was testing close to OOM
> > conditions. Maybe an earlier check bailed out but this code path looks
> > really obscure so it should either deserve a large fat comment or to
> > die.
> 
> The check is there so that compaction doesn't exhaust memory below reserves
> during its work, just like any other non-privileged allocation.

Hmm. OK this is a fair point. I would expect that the reclaim preceeding
the compaction would compensate for the temporarily used memory but it
is true that a) we might be in the optimistic async compaction which
happens _before_ the reclaim and b) the reclaim might be not effective
enough so some throttling is indeed appropriate.

I guess you do not want to rely on throttling only at the beginning of
the compaction because it would be too racy, which would be true. So I
guess it would be indeed safer to check for the watermark both when we
attempt to compact and when we isolate free pages. Can we at least use a
common helper so that we know that those checks are done same way?
 
 Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 00/13] make direct compaction more deterministic
  2016-05-10  7:35 [RFC 00/13] make direct compaction more deterministic Vlastimil Babka
                   ` (12 preceding siblings ...)
  2016-05-10  7:36 ` [RFC 13/13] mm, compaction: fix and improve watermark handling Vlastimil Babka
@ 2016-05-17 20:01 ` Michal Hocko
  2016-05-18  7:19   ` Vlastimil Babka
  13 siblings, 1 reply; 61+ messages in thread
From: Michal Hocko @ 2016-05-17 20:01 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

Btw. I think that first three patches are nice cleanups and easy enough
so I would vote for merging them earlier.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 00/13] make direct compaction more deterministic
  2016-05-17 20:01 ` [RFC 00/13] make direct compaction more deterministic Michal Hocko
@ 2016-05-18  7:19   ` Vlastimil Babka
  0 siblings, 0 replies; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-18  7:19 UTC (permalink / raw)
  To: Michal Hocko, Andrew Morton
  Cc: linux-mm, Joonsoo Kim, Rik van Riel, David Rientjes, Mel Gorman,
	Johannes Weiner, Tetsuo Handa, linux-kernel, Linus Torvalds

On 05/17/2016 10:01 PM, Michal Hocko wrote:
> Btw. I think that first three patches are nice cleanups and easy enough
> so I would vote for merging them earlier.

I wouldn't mind if patches 1-3 (note: second version of patch 2 posted 
as reply!) went to mmotm now, but it's merge window already, so it's 
unlikely to get into 4.7 anyway?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 06/13] mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
  2016-05-13 12:05       ` Michal Hocko
@ 2016-05-18 11:59         ` Vlastimil Babka
  2016-05-18 15:24           ` Michal Hocko
  0 siblings, 1 reply; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-18 11:59 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On 05/13/2016 02:05 PM, Michal Hocko wrote:
> On Fri 13-05-16 10:23:31, Vlastimil Babka wrote:
>> On 05/12/2016 06:20 PM, Michal Hocko wrote:
>>> On Tue 10-05-16 09:35:56, Vlastimil Babka wrote:
>>> [...]
>>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
>>>> index 570383a41853..0cb09714d960 100644
>>>> --- a/include/linux/gfp.h
>>>> +++ b/include/linux/gfp.h
>>>> @@ -256,8 +256,7 @@ struct vm_area_struct;
>>>>    #define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
>>>>    #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
>>>>    #define GFP_TRANSHUGE	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
>>>> -			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
>>>> -			 ~__GFP_RECLAIM)
>>>> +			 __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)
>>>
>>> I am not sure this is the right thing to do. I think we should keep
>>> __GFP_NORETRY and clear it where we want a stronger semantic. This is
>>> just too suble that all callsites are doing the right thing.
>>
>> That would complicate alloc_hugepage_direct_gfpmask() a bit, but if you
>> think it's worth it, I can turn the default around, OK.
> 
> Hmm, on the other hand it is true that GFP_TRANSHUGE is clearing both
> reclaim flags by default and then overwrites that. This is just too
> ugly. Can we make GFP_TRANSHUGE to only define flags we care about and
> then tweak those that should go away at the callsites which matter now
> that we do not rely on is_thp_gfp_mask?
 
So the following patch attempts what you suggest, if I understand you
correctly. GFP_TRANSHUGE includes all possible flag, and then they are
removed as needed. I don't really think it helps code readability
though. IMHO it's simpler to define GFP_TRANSHUGE as minimal subset and
only add flags on top. You call the resulting #define ugly, but imho it's
better to have ugliness at a single place, and not at multiple usage places
(see the diff below).

Note that this also affects the printk stuff.
With GFP_TRANSHUGE including all possible flags, it's unlikely printk
will ever print "GFP_TRANSHUGE", since most likely one or more flags
will be always missing.

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 570383a41853..e1998eb5c37f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -256,8 +256,7 @@ struct vm_area_struct;
 #define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
 #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
 #define GFP_TRANSHUGE	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
-			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
-			 ~__GFP_RECLAIM)
+			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN)
 
 /* Convert GFP flags to their corresponding migrate type */
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 87f09dc986ab..370fbd3b24dd 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -216,7 +216,8 @@ struct page *get_huge_zero_page(void)
 	if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
 		return READ_ONCE(huge_zero_page);
 
-	zero_page = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
+	zero_page = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO)
+			& ~(__GFP_MOVABLE | __GFP_NORETRY),
 			HPAGE_PMD_ORDER);
 	if (!zero_page) {
 		count_vm_event(THP_ZERO_PAGE_ALLOC_FAILED);
@@ -882,9 +883,10 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 }
 
 /*
- * If THP is set to always then directly reclaim/compact as necessary
- * If set to defer then do no reclaim and defer to khugepaged
+ * If THP defrag is set to always then directly reclaim/compact as necessary
+ * If set to defer then do only background reclaim/compact and defer to khugepaged
  * If set to madvise and the VMA is flagged then directly reclaim/compact
+ * When direct reclaim/compact is allowed, try a bit harder for flagged VMA's
  */
 static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 {
@@ -896,15 +898,21 @@ static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
 		reclaim_flags = __GFP_KSWAPD_RECLAIM;
 	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		reclaim_flags = __GFP_DIRECT_RECLAIM;
+		reclaim_flags = __GFP_DIRECT_RECLAIM |
+					((vma->vm_flags & VM_HUGEPAGE) ? 0 : __GFP_NORETRY);
 
-	return GFP_TRANSHUGE | reclaim_flags;
+	return (GFP_TRANSHUGE & ~(__GFP_RECLAIM | __GFP_NORETRY)) | reclaim_flags;
 }
 
 /* Defrag for khugepaged will enter direct reclaim/compaction if necessary */
 static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
 {
-	return GFP_TRANSHUGE | (khugepaged_defrag() ? __GFP_DIRECT_RECLAIM : 0);
+	/*
+	 * We don't want kswapd reclaim, and if khugepaged/defrag is disabled
+	 * we disable also direct reclaim. If we do direct reclaim, do retry.
+	 */
+	return GFP_TRANSHUGE & ~(khugepaged_defrag() ?
+			(__GFP_KSWAPD_RECLAIM | __GFP_NORETRY) : __GFP_RECLAIM);
 }
 
 /* Caller must hold page table lock. */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0cee863397e4..4a34187827ca 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3619,11 +3619,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 			/*
 			 * Looks like reclaim/compaction is worth trying, but
 			 * sync compaction could be very expensive, so keep
-			 * using async compaction, unless it's khugepaged
-			 * trying to collapse.
+			 * using async compaction.
 			 */
-			if (!(current->flags & PF_KTHREAD))
-				migration_mode = MIGRATE_ASYNC;
+			migration_mode = MIGRATE_ASYNC;
 		}
 	}
 
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [RFC 11/13] mm, compaction: add the ultimate direct compaction priority
  2016-05-16  7:17     ` Vlastimil Babka
  2016-05-16  8:11       ` Michal Hocko
@ 2016-05-18 12:46       ` Vlastimil Babka
  1 sibling, 0 replies; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-18 12:46 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On 05/16/2016 09:17 AM, Vlastimil Babka wrote:
>> >Wouldn't it be better to pull the prio check into compaction_deferred
>> >directly? There are more callers and I am not really sure all of them
>> >would behave consistently.
> I'll check, thanks.

Hm so the other callers of compaction_deferred() are in the context 
where there's no direct compaction priority set. They would have to pass 
something like DEF_COMPACT_PRIORITY. That starts getting subtle so I'd 
rather not go that way.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 13/13] mm, compaction: fix and improve watermark handling
  2016-05-16  9:25   ` Michal Hocko
  2016-05-16  9:50     ` Vlastimil Babka
@ 2016-05-18 13:50     ` Mel Gorman
  2016-05-18 14:27       ` Michal Hocko
  1 sibling, 1 reply; 61+ messages in thread
From: Mel Gorman @ 2016-05-18 13:50 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, linux-mm, Andrew Morton, Joonsoo Kim,
	Rik van Riel, David Rientjes, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On Mon, May 16, 2016 at 11:25:05AM +0200, Michal Hocko wrote:
> On Tue 10-05-16 09:36:03, Vlastimil Babka wrote:
> > Compaction has been using watermark checks when deciding whether it was
> > successful, and whether compaction is at all suitable. There are few problems
> > with these checks.
> > 
> > - __compact_finished() uses low watermark in a check that has to pass if
> >   the direct compaction is to finish and allocation should succeed. This is
> >   too pessimistic, as the allocation will typically use min watermark. It
> >   may happen that during compaction, we drop below the low watermark (due to
> >   parallel activity), but still form the target high-order page. By checking
> >   against low watermark, we might needlessly continue compaction. After this
> >   patch, the check uses direct compactor's alloc_flags to determine the
> >   watermark, which is effectively the min watermark.
> 
> OK, this makes some sense. It would be great if we could have at least
> some clarification why the low wmark has been used previously. Probably
> Mel can remember?
> 

Two reasons -- it was a very rough estimate of whether enough pages are free
for compaction to have any chance. Secondly, it was to minimise the risk
that compaction would isolate so many pages that the zone was completely
depleted. This was a concern during the initial prototype of compaction.

> > - __compaction_suitable() then checks the low watermark plus a (2 << order) gap
> >   to decide if there's enough free memory to perform compaction. This check
> 
> And this was a real head scratcher when I started looking into the
> compaction recently. Why do we need to be above low watermark to even
> start compaction. Compaction uses additional memory only for a short
> period of time and then releases the already migrated pages.
> 

Simply minimising the risk that compaction would deplete the entire
zone. Sure, it hands pages back shortly afterwards. At the time of the
initial prototype, page migration was severely broken and the system was
constantly crashing. The cautious checks were left in place after page
migration was fixed as there wasn't a compelling reason to remove them
at the time.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 13/13] mm, compaction: fix and improve watermark handling
  2016-05-18 13:50     ` Mel Gorman
@ 2016-05-18 14:27       ` Michal Hocko
  2016-05-18 14:40         ` Mel Gorman
  0 siblings, 1 reply; 61+ messages in thread
From: Michal Hocko @ 2016-05-18 14:27 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vlastimil Babka, linux-mm, Andrew Morton, Joonsoo Kim,
	Rik van Riel, David Rientjes, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On Wed 18-05-16 14:50:04, Mel Gorman wrote:
> On Mon, May 16, 2016 at 11:25:05AM +0200, Michal Hocko wrote:
> > On Tue 10-05-16 09:36:03, Vlastimil Babka wrote:
> > > Compaction has been using watermark checks when deciding whether it was
> > > successful, and whether compaction is at all suitable. There are few problems
> > > with these checks.
> > > 
> > > - __compact_finished() uses low watermark in a check that has to pass if
> > >   the direct compaction is to finish and allocation should succeed. This is
> > >   too pessimistic, as the allocation will typically use min watermark. It
> > >   may happen that during compaction, we drop below the low watermark (due to
> > >   parallel activity), but still form the target high-order page. By checking
> > >   against low watermark, we might needlessly continue compaction. After this
> > >   patch, the check uses direct compactor's alloc_flags to determine the
> > >   watermark, which is effectively the min watermark.
> > 
> > OK, this makes some sense. It would be great if we could have at least
> > some clarification why the low wmark has been used previously. Probably
> > Mel can remember?
> > 
> 
> Two reasons -- it was a very rough estimate of whether enough pages are free
> for compaction to have any chance. Secondly, it was to minimise the risk
> that compaction would isolate so many pages that the zone was completely
> depleted. This was a concern during the initial prototype of compaction.
> 
> > > - __compaction_suitable() then checks the low watermark plus a (2 << order) gap
> > >   to decide if there's enough free memory to perform compaction. This check
> > 
> > And this was a real head scratcher when I started looking into the
> > compaction recently. Why do we need to be above low watermark to even
> > start compaction. Compaction uses additional memory only for a short
> > period of time and then releases the already migrated pages.
> > 
> 
> Simply minimising the risk that compaction would deplete the entire
> zone. Sure, it hands pages back shortly afterwards. At the time of the
> initial prototype, page migration was severely broken and the system was
> constantly crashing. The cautious checks were left in place after page
> migration was fixed as there wasn't a compelling reason to remove them
> at the time.

OK, then moving to min_wmark + bias from low_wmark should work, right?
This would at least remove the discrepancy between the reclaim and
compaction thresholds to some degree. Which is good IMHO.

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 13/13] mm, compaction: fix and improve watermark handling
  2016-05-18 14:27       ` Michal Hocko
@ 2016-05-18 14:40         ` Mel Gorman
  0 siblings, 0 replies; 61+ messages in thread
From: Mel Gorman @ 2016-05-18 14:40 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, linux-mm, Andrew Morton, Joonsoo Kim,
	Rik van Riel, David Rientjes, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On Wed, May 18, 2016 at 04:27:53PM +0200, Michal Hocko wrote:
> > > > - __compaction_suitable() then checks the low watermark plus a (2 << order) gap
> > > >   to decide if there's enough free memory to perform compaction. This check
> > > 
> > > And this was a real head scratcher when I started looking into the
> > > compaction recently. Why do we need to be above low watermark to even
> > > start compaction. Compaction uses additional memory only for a short
> > > period of time and then releases the already migrated pages.
> > > 
> > 
> > Simply minimising the risk that compaction would deplete the entire
> > zone. Sure, it hands pages back shortly afterwards. At the time of the
> > initial prototype, page migration was severely broken and the system was
> > constantly crashing. The cautious checks were left in place after page
> > migration was fixed as there wasn't a compelling reason to remove them
> > at the time.
> 
> OK, then moving to min_wmark + bias from low_wmark should work, right?

Yes. I did recall there was another reason but it's marginal. I didn't
want compaction isolation free pages to artifically push a process into
direct reclaim but given that we are likely under memory pressure at
that time anyway, it's unlikely that compaction is the sole reason
processes are entering direct reclaim.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 06/13] mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
  2016-05-18 11:59         ` Vlastimil Babka
@ 2016-05-18 15:24           ` Michal Hocko
  2016-05-20 13:57             ` Vlastimil Babka
  0 siblings, 1 reply; 61+ messages in thread
From: Michal Hocko @ 2016-05-18 15:24 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On Wed 18-05-16 13:59:53, Vlastimil Babka wrote:
> On 05/13/2016 02:05 PM, Michal Hocko wrote:
> > On Fri 13-05-16 10:23:31, Vlastimil Babka wrote:
> >> On 05/12/2016 06:20 PM, Michal Hocko wrote:
> >>> On Tue 10-05-16 09:35:56, Vlastimil Babka wrote:
> >>> [...]
> >>>> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> >>>> index 570383a41853..0cb09714d960 100644
> >>>> --- a/include/linux/gfp.h
> >>>> +++ b/include/linux/gfp.h
> >>>> @@ -256,8 +256,7 @@ struct vm_area_struct;
> >>>>    #define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
> >>>>    #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
> >>>>    #define GFP_TRANSHUGE	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
> >>>> -			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
> >>>> -			 ~__GFP_RECLAIM)
> >>>> +			 __GFP_NOMEMALLOC | __GFP_NOWARN) & ~__GFP_RECLAIM)
> >>>
> >>> I am not sure this is the right thing to do. I think we should keep
> >>> __GFP_NORETRY and clear it where we want a stronger semantic. This is
> >>> just too suble that all callsites are doing the right thing.
> >>
> >> That would complicate alloc_hugepage_direct_gfpmask() a bit, but if you
> >> think it's worth it, I can turn the default around, OK.
> > 
> > Hmm, on the other hand it is true that GFP_TRANSHUGE is clearing both
> > reclaim flags by default and then overwrites that. This is just too
> > ugly. Can we make GFP_TRANSHUGE to only define flags we care about and
> > then tweak those that should go away at the callsites which matter now
> > that we do not rely on is_thp_gfp_mask?
>  
> So the following patch attempts what you suggest, if I understand you
> correctly. GFP_TRANSHUGE includes all possible flag, and then they are
> removed as needed. I don't really think it helps code readability
> though.

yeah it is ugly has _hell_. I do not think this deserves too much time
to discuss as the flag is mostly internal but one last proposal would be
to define different THP allocations context explicitly. Some callers
would still need some additional meddling but maybe it would be slightly
better to read. Dunno. Anyway if you think this is not really an
improvement then I won't insist on any change to your original patch.
---
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 570383a41853..e7926b466107 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -255,9 +255,14 @@ struct vm_area_struct;
 #define GFP_DMA32	__GFP_DMA32
 #define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
 #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
-#define GFP_TRANSHUGE	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
+
+/* Optimistic or latency sensitive THP allocation - page fault path */
+#define GFP_TRANSHUGE_LIGHT	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
 			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
 			 ~__GFP_RECLAIM)
+/* More serious THP allocation request - kcompactd */
+#define GFP_TRANSHUGE (GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM) & \
+			~__GFP_NORETRY
 
 /* Convert GFP flags to their corresponding migrate type */
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1a4d4c807d92..937b89c6c0aa 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -216,7 +216,7 @@ struct page *get_huge_zero_page(void)
 	if (likely(atomic_inc_not_zero(&huge_zero_refcount)))
 		return READ_ONCE(huge_zero_page);
 
-	zero_page = alloc_pages((GFP_TRANSHUGE | __GFP_ZERO) & ~__GFP_MOVABLE,
+	zero_page = alloc_pages((GFP_TRANSHUGE_LIGHT | __GFP_ZERO) & ~__GFP_MOVABLE,
 			HPAGE_PMD_ORDER);
 	if (!zero_page) {
 		count_vm_event(THP_ZERO_PAGE_ALLOC_FAILED);
@@ -888,23 +888,31 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
  */
 static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 {
-	gfp_t reclaim_flags = 0;
+	gfp_t gfp_mask = GFP_TRANSHUGE_LIGHT;
 
 	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags) &&
 	    (vma->vm_flags & VM_HUGEPAGE))
-		reclaim_flags = __GFP_DIRECT_RECLAIM;
+		gfp_mask = GFP_TRANSHUGE;
 	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
-		reclaim_flags = __GFP_KSWAPD_RECLAIM;
-	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		reclaim_flags = __GFP_DIRECT_RECLAIM;
+		gfp_mask |= __GFP_KSWAPD_RECLAIM;
+	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags)) {
+		if (vm->vm_flags & VM_HUGEPAGE)
+			gfp_mask = GFP_TRANSHUGE;
+		else
+			gfp_mask = GFP_TRANSHUGE | __GFP_NORETRY;
+	}
 
-	return GFP_TRANSHUGE | reclaim_flags;
+	return gfp_mask;
 }
 
 /* Defrag for khugepaged will enter direct reclaim/compaction if necessary */
 static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
 {
-	return GFP_TRANSHUGE | (khugepaged_defrag() ? __GFP_DIRECT_RECLAIM : 0);
+	gfp_t gfp_mask = GFP_TRANSHUGE_LIGHT;
+	if (khugepaged_defrag())
+		gfp_mask = GFP_TRANSHUGE;
+
+	return gfp_mask;
 }
 
 /* Caller must hold page table lock. */
diff --git a/mm/migrate.c b/mm/migrate.c
index 53ab6398e7a2..1cd5c8c18343 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1771,7 +1771,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 		goto out_dropref;
 
 	new_page = alloc_pages_node(node,
-		(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
+		(GFP_TRANSHUGE_LIGHT | __GFP_THISNODE) & ~__GFP_RECLAIM,
 		HPAGE_PMD_ORDER);
 	if (!new_page)
 		goto out_fail;
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [RFC 06/13] mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
  2016-05-18 15:24           ` Michal Hocko
@ 2016-05-20 13:57             ` Vlastimil Babka
  2016-05-23  8:39               ` Michal Hocko
  0 siblings, 1 reply; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-20 13:57 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On 05/18/2016 05:24 PM, Michal Hocko wrote:
>>   
>> So the following patch attempts what you suggest, if I understand you
>> correctly. GFP_TRANSHUGE includes all possible flag, and then they are
>> removed as needed. I don't really think it helps code readability
>> though.
> 
> yeah it is ugly has _hell_. I do not think this deserves too much time
> to discuss as the flag is mostly internal but one last proposal would be
> to define different THP allocations context explicitly. Some callers
> would still need some additional meddling but maybe it would be slightly
> better to read. Dunno. Anyway if you think this is not really an
> improvement then I won't insist on any change to your original patch.
> ---
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 570383a41853..e7926b466107 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -255,9 +255,14 @@ struct vm_area_struct;
>   #define GFP_DMA32	__GFP_DMA32
>   #define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
>   #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
> -#define GFP_TRANSHUGE	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
> +
> +/* Optimistic or latency sensitive THP allocation - page fault path */
> +#define GFP_TRANSHUGE_LIGHT	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
>   			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
>   			 ~__GFP_RECLAIM)
> +/* More serious THP allocation request - kcompactd */
> +#define GFP_TRANSHUGE (GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM) & \
> +			~__GFP_NORETRY

[...]

OK I took the core idea and arrived at the following. I think it could
work and the amount of further per-site modifications to GFP_TRANSHUGE*
is reduced, but if anyone think it's overkill to have two
GFP_TRANSHUGE*, I will just return to the original patch.

>From 48ddb10e96fd9741a9eb3be9672c13589db7239a Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Wed, 4 May 2016 13:40:03 +0200
Subject: [PATCH] mm, thp: remove __GFP_NORETRY from khugepaged and madvised
 allocations

After the previous patch, we can distinguish costly allocations that should be
really lightweight, such as THP page faults, with __GFP_NORETRY. This means we
don't need to recognize khugepaged allocations via PF_KTHREAD anymore. We can
also change THP page faults in areas where madvise(MADV_HUGEPAGE) was used to
try as hard as khugepaged, as the process has indicated that it benefits from
THP's and is willing to pay some initial latency costs.

We can also make the flags handling less cryptic by distinguishing
GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from
GFP_TRANSHUGE (only direct reclaim, khugepaged default). Adding __GFP_NORETRY
or __GFP_KSWAPD_RECLAIM is done where needed.

The patch effectively changes the current GFP_TRANSHUGE users as follows:

* get_huge_zero_page() - the zero page lifetime should be relatively long and
  it's shared by multiple users, so it's worth spending some effort on it.
  We use GFP_TRANSHUGE, and __GFP_NORETRY is not added. This also restores
  direct reclaim to this allocation, which was unintentionally removed by
  commit e4a49efe4e7e ("mm: thp: set THP defrag by default to madvise and add
  a stall-free defrag option")

* alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency is not
  an issue. So if khugepaged "defrag" is enabled (the default), do reclaim
  via GFP_TRANSHUGE without __GFP_NORETRY. We can remove the PF_KTHREAD check
  from page alloc.
  As a side-effect, khugepaged will now no longer check if the initial
  compaction was deferred or contended. This is OK, as khugepaged sleep times
  between collapsion attemps are long enough to prevent noticeable disruption,
  so we should allow it to spend some effort.

* migrate_misplaced_transhuge_page() - already was masking out __GFP_RECLAIM,
  so just convert to GFP_TRANSHUGE_LIGHT which is equivalent.

* alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise) are
  now allocating without __GFP_NORETRY. Other vma's keep using __GFP_NORETRY
  if direct reclaim/compaction is at all allowed (by default it's allowed only
  for madvised vma's). The rest is conversion to GFP_TRANSHUGE(_LIGHT).

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/gfp.h            | 14 ++++++++------
 include/trace/events/mmflags.h |  1 +
 mm/huge_memory.c               | 27 +++++++++++++++------------
 mm/migrate.c                   |  2 +-
 mm/page_alloc.c                |  6 ++----
 tools/perf/builtin-kmem.c      |  1 +
 6 files changed, 28 insertions(+), 23 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 570383a41853..1dfca27df492 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -238,9 +238,11 @@ struct vm_area_struct;
  *   are expected to be movable via page reclaim or page migration. Typically,
  *   pages on the LRU would also be allocated with GFP_HIGHUSER_MOVABLE.
  *
- * GFP_TRANSHUGE is used for THP allocations. They are compound allocations
- *   that will fail quickly if memory is not available and will not wake
- *   kswapd on failure.
+ * GFP_TRANSHUGE and GFP_TRANSHUGE_LIGHT are used for THP allocations. They are
+ *   compound allocations that will generally fail quickly if memory is not
+ *   available and will not wake kswapd/kcompactd on failure. The _LIGHT
+ *   version does not attempt reclaim/compaction at all and is by default used
+ *   in page fault path, while the non-light is used by khugepaged.
  */
 #define GFP_ATOMIC	(__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
 #define GFP_KERNEL	(__GFP_RECLAIM | __GFP_IO | __GFP_FS)
@@ -255,9 +257,9 @@ struct vm_area_struct;
 #define GFP_DMA32	__GFP_DMA32
 #define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
 #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
-#define GFP_TRANSHUGE	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
-			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
-			 ~__GFP_RECLAIM)
+#define GFP_TRANSHUGE_LIGHT	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
+			 __GFP_NOMEMALLOC| __GFP_NOWARN) & ~__GFP_RECLAIM)
+#define GFP_TRANSHUGE	(GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM)
 
 /* Convert GFP flags to their corresponding migrate type */
 #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
index 43cedbf0c759..5a81ab48a2fb 100644
--- a/include/trace/events/mmflags.h
+++ b/include/trace/events/mmflags.h
@@ -11,6 +11,7 @@
 
 #define __def_gfpflag_names						\
 	{(unsigned long)GFP_TRANSHUGE,		"GFP_TRANSHUGE"},	\
+	{(unsigned long)GFP_TRANSHUGE_LIGHT,	"GFP_TRANSHUGE_LIGHT"}, \
 	{(unsigned long)GFP_HIGHUSER_MOVABLE,	"GFP_HIGHUSER_MOVABLE"},\
 	{(unsigned long)GFP_HIGHUSER,		"GFP_HIGHUSER"},	\
 	{(unsigned long)GFP_USER,		"GFP_USER"},		\
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 87f09dc986ab..aa87db8c7f8f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -882,29 +882,32 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
 }
 
 /*
- * If THP is set to always then directly reclaim/compact as necessary
- * If set to defer then do no reclaim and defer to khugepaged
+ * If THP defrag is set to always then directly reclaim/compact as necessary
+ * If set to defer then do only background reclaim/compact and defer to khugepaged
  * If set to madvise and the VMA is flagged then directly reclaim/compact
+ * When direct reclaim/compact is allowed, don't retry except for flagged VMA's
  */
 static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
 {
-	gfp_t reclaim_flags = 0;
+	bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
 
-	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags) &&
-	    (vma->vm_flags & VM_HUGEPAGE))
-		reclaim_flags = __GFP_DIRECT_RECLAIM;
-	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
-		reclaim_flags = __GFP_KSWAPD_RECLAIM;
-	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
-		reclaim_flags = __GFP_DIRECT_RECLAIM;
+	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
+				&transparent_hugepage_flags) && vma_madvised)
+		return GFP_TRANSHUGE;
+	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
+						&transparent_hugepage_flags))
+		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
+	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
+						&transparent_hugepage_flags))
+		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
 
-	return GFP_TRANSHUGE | reclaim_flags;
+	return GFP_TRANSHUGE_LIGHT;
 }
 
 /* Defrag for khugepaged will enter direct reclaim/compaction if necessary */
 static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
 {
-	return GFP_TRANSHUGE | (khugepaged_defrag() ? __GFP_DIRECT_RECLAIM : 0);
+	return khugepaged_defrag() ? GFP_TRANSHUGE : GFP_TRANSHUGE_LIGHT;
 }
 
 /* Caller must hold page table lock. */
diff --git a/mm/migrate.c b/mm/migrate.c
index 53ab6398e7a2..bc82c56fa3af 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1771,7 +1771,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 		goto out_dropref;
 
 	new_page = alloc_pages_node(node,
-		(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
+		(GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
 		HPAGE_PMD_ORDER);
 	if (!new_page)
 		goto out_fail;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0cee863397e4..4a34187827ca 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3619,11 +3619,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
 			/*
 			 * Looks like reclaim/compaction is worth trying, but
 			 * sync compaction could be very expensive, so keep
-			 * using async compaction, unless it's khugepaged
-			 * trying to collapse.
+			 * using async compaction.
 			 */
-			if (!(current->flags & PF_KTHREAD))
-				migration_mode = MIGRATE_ASYNC;
+			migration_mode = MIGRATE_ASYNC;
 		}
 	}
 
diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c
index 5da5a9511cef..7fde754b344d 100644
--- a/tools/perf/builtin-kmem.c
+++ b/tools/perf/builtin-kmem.c
@@ -608,6 +608,7 @@ static const struct {
 	const char *compact;
 } gfp_compact_table[] = {
 	{ "GFP_TRANSHUGE",		"THP" },
+	{ "GFP_TRANSHUGE_LIGHT",	"THL" },
 	{ "GFP_HIGHUSER_MOVABLE",	"HUM" },
 	{ "GFP_HIGHUSER",		"HU" },
 	{ "GFP_USER",			"U" },
-- 
2.8.2

^ permalink raw reply related	[flat|nested] 61+ messages in thread

* Re: [RFC 06/13] mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations
  2016-05-20 13:57             ` Vlastimil Babka
@ 2016-05-23  8:39               ` Michal Hocko
  0 siblings, 0 replies; 61+ messages in thread
From: Michal Hocko @ 2016-05-23  8:39 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Joonsoo Kim, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On Fri 20-05-16 15:57:08, Vlastimil Babka wrote:
[...]
> From: Vlastimil Babka <vbabka@suse.cz>
> Date: Wed, 4 May 2016 13:40:03 +0200
> Subject: [PATCH] mm, thp: remove __GFP_NORETRY from khugepaged and madvised
>  allocations
> 
> After the previous patch, we can distinguish costly allocations that should be
> really lightweight, such as THP page faults, with __GFP_NORETRY. This means we
> don't need to recognize khugepaged allocations via PF_KTHREAD anymore. We can
> also change THP page faults in areas where madvise(MADV_HUGEPAGE) was used to
> try as hard as khugepaged, as the process has indicated that it benefits from
> THP's and is willing to pay some initial latency costs.
> 
> We can also make the flags handling less cryptic by distinguishing
> GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from
> GFP_TRANSHUGE (only direct reclaim, khugepaged default). Adding __GFP_NORETRY
> or __GFP_KSWAPD_RECLAIM is done where needed.
> 
> The patch effectively changes the current GFP_TRANSHUGE users as follows:
> 
> * get_huge_zero_page() - the zero page lifetime should be relatively long and
>   it's shared by multiple users, so it's worth spending some effort on it.
>   We use GFP_TRANSHUGE, and __GFP_NORETRY is not added. This also restores
>   direct reclaim to this allocation, which was unintentionally removed by
>   commit e4a49efe4e7e ("mm: thp: set THP defrag by default to madvise and add
>   a stall-free defrag option")
> 
> * alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency is not
>   an issue. So if khugepaged "defrag" is enabled (the default), do reclaim
>   via GFP_TRANSHUGE without __GFP_NORETRY. We can remove the PF_KTHREAD check
>   from page alloc.
>   As a side-effect, khugepaged will now no longer check if the initial
>   compaction was deferred or contended. This is OK, as khugepaged sleep times
>   between collapsion attemps are long enough to prevent noticeable disruption,
>   so we should allow it to spend some effort.
> 
> * migrate_misplaced_transhuge_page() - already was masking out __GFP_RECLAIM,
>   so just convert to GFP_TRANSHUGE_LIGHT which is equivalent.
> 
> * alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise) are
>   now allocating without __GFP_NORETRY. Other vma's keep using __GFP_NORETRY
>   if direct reclaim/compaction is at all allowed (by default it's allowed only
>   for madvised vma's). The rest is conversion to GFP_TRANSHUGE(_LIGHT).
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

I like it more than the previous approach.

Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  include/linux/gfp.h            | 14 ++++++++------
>  include/trace/events/mmflags.h |  1 +
>  mm/huge_memory.c               | 27 +++++++++++++++------------
>  mm/migrate.c                   |  2 +-
>  mm/page_alloc.c                |  6 ++----
>  tools/perf/builtin-kmem.c      |  1 +
>  6 files changed, 28 insertions(+), 23 deletions(-)
> 
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 570383a41853..1dfca27df492 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -238,9 +238,11 @@ struct vm_area_struct;
>   *   are expected to be movable via page reclaim or page migration. Typically,
>   *   pages on the LRU would also be allocated with GFP_HIGHUSER_MOVABLE.
>   *
> - * GFP_TRANSHUGE is used for THP allocations. They are compound allocations
> - *   that will fail quickly if memory is not available and will not wake
> - *   kswapd on failure.
> + * GFP_TRANSHUGE and GFP_TRANSHUGE_LIGHT are used for THP allocations. They are
> + *   compound allocations that will generally fail quickly if memory is not
> + *   available and will not wake kswapd/kcompactd on failure. The _LIGHT
> + *   version does not attempt reclaim/compaction at all and is by default used
> + *   in page fault path, while the non-light is used by khugepaged.
>   */
>  #define GFP_ATOMIC	(__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
>  #define GFP_KERNEL	(__GFP_RECLAIM | __GFP_IO | __GFP_FS)
> @@ -255,9 +257,9 @@ struct vm_area_struct;
>  #define GFP_DMA32	__GFP_DMA32
>  #define GFP_HIGHUSER	(GFP_USER | __GFP_HIGHMEM)
>  #define GFP_HIGHUSER_MOVABLE	(GFP_HIGHUSER | __GFP_MOVABLE)
> -#define GFP_TRANSHUGE	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
> -			 __GFP_NOMEMALLOC | __GFP_NORETRY | __GFP_NOWARN) & \
> -			 ~__GFP_RECLAIM)
> +#define GFP_TRANSHUGE_LIGHT	((GFP_HIGHUSER_MOVABLE | __GFP_COMP | \
> +			 __GFP_NOMEMALLOC| __GFP_NOWARN) & ~__GFP_RECLAIM)
> +#define GFP_TRANSHUGE	(GFP_TRANSHUGE_LIGHT | __GFP_DIRECT_RECLAIM)
>  
>  /* Convert GFP flags to their corresponding migrate type */
>  #define GFP_MOVABLE_MASK (__GFP_RECLAIMABLE|__GFP_MOVABLE)
> diff --git a/include/trace/events/mmflags.h b/include/trace/events/mmflags.h
> index 43cedbf0c759..5a81ab48a2fb 100644
> --- a/include/trace/events/mmflags.h
> +++ b/include/trace/events/mmflags.h
> @@ -11,6 +11,7 @@
>  
>  #define __def_gfpflag_names						\
>  	{(unsigned long)GFP_TRANSHUGE,		"GFP_TRANSHUGE"},	\
> +	{(unsigned long)GFP_TRANSHUGE_LIGHT,	"GFP_TRANSHUGE_LIGHT"}, \
>  	{(unsigned long)GFP_HIGHUSER_MOVABLE,	"GFP_HIGHUSER_MOVABLE"},\
>  	{(unsigned long)GFP_HIGHUSER,		"GFP_HIGHUSER"},	\
>  	{(unsigned long)GFP_USER,		"GFP_USER"},		\
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 87f09dc986ab..aa87db8c7f8f 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -882,29 +882,32 @@ static int __do_huge_pmd_anonymous_page(struct mm_struct *mm,
>  }
>  
>  /*
> - * If THP is set to always then directly reclaim/compact as necessary
> - * If set to defer then do no reclaim and defer to khugepaged
> + * If THP defrag is set to always then directly reclaim/compact as necessary
> + * If set to defer then do only background reclaim/compact and defer to khugepaged
>   * If set to madvise and the VMA is flagged then directly reclaim/compact
> + * When direct reclaim/compact is allowed, don't retry except for flagged VMA's
>   */
>  static inline gfp_t alloc_hugepage_direct_gfpmask(struct vm_area_struct *vma)
>  {
> -	gfp_t reclaim_flags = 0;
> +	bool vma_madvised = !!(vma->vm_flags & VM_HUGEPAGE);
>  
> -	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG, &transparent_hugepage_flags) &&
> -	    (vma->vm_flags & VM_HUGEPAGE))
> -		reclaim_flags = __GFP_DIRECT_RECLAIM;
> -	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG, &transparent_hugepage_flags))
> -		reclaim_flags = __GFP_KSWAPD_RECLAIM;
> -	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG, &transparent_hugepage_flags))
> -		reclaim_flags = __GFP_DIRECT_RECLAIM;
> +	if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_REQ_MADV_FLAG,
> +				&transparent_hugepage_flags) && vma_madvised)
> +		return GFP_TRANSHUGE;
> +	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_KSWAPD_FLAG,
> +						&transparent_hugepage_flags))
> +		return GFP_TRANSHUGE_LIGHT | __GFP_KSWAPD_RECLAIM;
> +	else if (test_bit(TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
> +						&transparent_hugepage_flags))
> +		return GFP_TRANSHUGE | (vma_madvised ? 0 : __GFP_NORETRY);
>  
> -	return GFP_TRANSHUGE | reclaim_flags;
> +	return GFP_TRANSHUGE_LIGHT;
>  }
>  
>  /* Defrag for khugepaged will enter direct reclaim/compaction if necessary */
>  static inline gfp_t alloc_hugepage_khugepaged_gfpmask(void)
>  {
> -	return GFP_TRANSHUGE | (khugepaged_defrag() ? __GFP_DIRECT_RECLAIM : 0);
> +	return khugepaged_defrag() ? GFP_TRANSHUGE : GFP_TRANSHUGE_LIGHT;
>  }
>  
>  /* Caller must hold page table lock. */
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 53ab6398e7a2..bc82c56fa3af 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1771,7 +1771,7 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
>  		goto out_dropref;
>  
>  	new_page = alloc_pages_node(node,
> -		(GFP_TRANSHUGE | __GFP_THISNODE) & ~__GFP_RECLAIM,
> +		(GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
>  		HPAGE_PMD_ORDER);
>  	if (!new_page)
>  		goto out_fail;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0cee863397e4..4a34187827ca 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3619,11 +3619,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  			/*
>  			 * Looks like reclaim/compaction is worth trying, but
>  			 * sync compaction could be very expensive, so keep
> -			 * using async compaction, unless it's khugepaged
> -			 * trying to collapse.
> +			 * using async compaction.
>  			 */
> -			if (!(current->flags & PF_KTHREAD))
> -				migration_mode = MIGRATE_ASYNC;
> +			migration_mode = MIGRATE_ASYNC;
>  		}
>  	}
>  
> diff --git a/tools/perf/builtin-kmem.c b/tools/perf/builtin-kmem.c
> index 5da5a9511cef..7fde754b344d 100644
> --- a/tools/perf/builtin-kmem.c
> +++ b/tools/perf/builtin-kmem.c
> @@ -608,6 +608,7 @@ static const struct {
>  	const char *compact;
>  } gfp_compact_table[] = {
>  	{ "GFP_TRANSHUGE",		"THP" },
> +	{ "GFP_TRANSHUGE_LIGHT",	"THL" },
>  	{ "GFP_HIGHUSER_MOVABLE",	"HUM" },
>  	{ "GFP_HIGHUSER",		"HU" },
>  	{ "GFP_USER",			"U" },
> -- 
> 2.8.2
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 02/13] mm, page_alloc: set alloc_flags only once in slowpath
  2016-05-10 12:30     ` Vlastimil Babka
  2016-05-12 12:41       ` Michal Hocko
@ 2016-05-31  6:20       ` Joonsoo Kim
  2016-05-31  7:59         ` Vlastimil Babka
  1 sibling, 1 reply; 61+ messages in thread
From: Joonsoo Kim @ 2016-05-31  6:20 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Tetsuo Handa, mhocko, linux-mm, akpm, riel, rientjes, mgorman,
	hannes, linux-kernel, torvalds

On Tue, May 10, 2016 at 02:30:11PM +0200, Vlastimil Babka wrote:
> On 05/10/2016 01:28 PM, Tetsuo Handa wrote:
> > Vlastimil Babka wrote:
> >> In __alloc_pages_slowpath(), alloc_flags doesn't change after it's initialized,
> >> so move the initialization above the retry: label. Also make the comment above
> >> the initialization more descriptive.
> > 
> > Not true. gfp_to_alloc_flags() will include ALLOC_NO_WATERMARKS if current
> > thread got TIF_MEMDIE after gfp_to_alloc_flags() was called for the first
> 
> Oh, right. Stupid global state.
> 
> > time. Do you want to make TIF_MEMDIE threads fail their allocations without
> > using memory reserves?
> 
> No, thanks for catching this. How about the following version? I think
> that's even nicer cleanup, if correct. Note it causes a conflict in
> patch 03/13 but it's simple to resolve.
> 
> Thanks
> 
> ----8<----
> >From 68f09f1d4381c7451238b4575557580380d8bf30 Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka <vbabka@suse.cz>
> Date: Fri, 29 Apr 2016 11:51:17 +0200
> Subject: [RFC 02/13] mm, page_alloc: set alloc_flags only once in slowpath
> 
> In __alloc_pages_slowpath(), alloc_flags doesn't change after it's initialized,
> so move the initialization above the retry: label. Also make the comment above
> the initialization more descriptive.
> 
> The only exception in the alloc_flags being constant is ALLOC_NO_WATERMARKS,
> which may change due to TIF_MEMDIE being set on the allocating thread. We can
> fix this, and make the code simpler and a bit more effective at the same time,
> by moving the part that determines ALLOC_NO_WATERMARKS from
> gfp_to_alloc_flags() to gfp_pfmemalloc_allowed(). This means we don't have to
> mask out ALLOC_NO_WATERMARKS in several places in __alloc_pages_slowpath()
> anymore.  The only test for the flag can instead call gfp_pfmemalloc_allowed().

Your patch looks correct to me but it makes me wonder something.
Why do we need to mask out ALLOC_NO_WATERMARKS in several places? If
some requestors have ALLOC_NO_WATERMARKS flag, he will
eventually do ALLOC_NO_WATERMARKS allocation in retry loop. I don't
understand what's the merit of masking out it.

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 03/13] mm, page_alloc: don't retry initial attempt in slowpath
  2016-05-10  7:35 ` [RFC 03/13] mm, page_alloc: don't retry initial attempt " Vlastimil Babka
  2016-05-12 12:48   ` Michal Hocko
@ 2016-05-31  6:25   ` Joonsoo Kim
  2016-05-31 12:03     ` Vlastimil Babka
  1 sibling, 1 reply; 61+ messages in thread
From: Joonsoo Kim @ 2016-05-31  6:25 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, linux-mm, Andrew Morton, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On Tue, May 10, 2016 at 09:35:53AM +0200, Vlastimil Babka wrote:
> After __alloc_pages_slowpath() sets up new alloc_flags and wakes up kswapd, it
> first tries get_page_from_freelist() with the new alloc_flags, as it may
> succeed e.g. due to using min watermark instead of low watermark. This attempt
> does not have to be retried on each loop, since direct reclaim, direct
> compaction and oom call get_page_from_freelist() themselves.

Hmm... there is a corner case. If did_some_progress is 0 or compaction
is deferred, get_page_from_freelist() isn't called. But, we can
succeed to allocate memory since there is a kswapd that reclaims
memory in background.

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 12/13] mm, compaction: more reliably increase direct compaction priority
  2016-05-10  7:36 ` [RFC 12/13] mm, compaction: more reliably increase " Vlastimil Babka
  2016-05-10 12:55   ` Vlastimil Babka
  2016-05-13 14:15   ` Michal Hocko
@ 2016-05-31  6:37   ` Joonsoo Kim
  2016-05-31 12:07     ` Vlastimil Babka
  2 siblings, 1 reply; 61+ messages in thread
From: Joonsoo Kim @ 2016-05-31  6:37 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, linux-mm, Andrew Morton, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On Tue, May 10, 2016 at 09:36:02AM +0200, Vlastimil Babka wrote:
> During reclaim/compaction loop, compaction priority can be increased by the
> should_compact_retry() function, but the current code is not optimal for
> several reasons:
> 
> - priority is only increased when compaction_failed() is true, which means
>   that compaction has scanned the whole zone. This may not happen even after
>   multiple attempts with the lower priority due to parallel activity, so we
>   might needlessly struggle on the lower priority.
> 
> - should_compact_retry() is only called when should_reclaim_retry() returns
>   false. This means that compaction priority cannot get increased as long
>   as reclaim makes sufficient progress. Theoretically, reclaim should stop
>   retrying for high-order allocations as long as the high-order page doesn't
>   exist but due to races, this may result in spurious retries when the
>   high-order page momentarily does exist.
> 
> We can remove these corner cases by making sure that should_compact_retry() is
> always called, and increases compaction priority if possible. Examining further
> the compaction result can be done only after reaching the highest priority.
> This is a simple solution and we don't need to worry about reaching the highest
> priority "too soon" here - when should_compact_retry() is called it means that
> the system is already struggling and the allocation is supposed to either try
> as hard as possible, or it cannot fail at all. There's not much point staying
> at lower priorities with heuristics that may result in only partial compaction.
> 
> The only exception here is the COMPACT_SKIPPED result, which means that
> compaction could not run at all due to failing order-0 watermarks. In that
> case, don't increase compaction priority, and check if compaction could proceed
> when everything reclaimable was reclaimed. Before this patch, this was tied to
> compaction_withdrawn(), but the other results considered there are in fact only
> due to low compaction priority so we can ignore them thanks to the patch.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> ---
>  mm/page_alloc.c | 46 +++++++++++++++++++++++-----------------------
>  1 file changed, 23 insertions(+), 23 deletions(-)
> 
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index aa9c39a7f40a..623027fb8121 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3248,28 +3248,27 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
>  		return false;
>  
>  	/*
> -	 * compaction considers all the zone as desperately out of memory
> -	 * so it doesn't really make much sense to retry except when the
> -	 * failure could be caused by insufficient priority
> +	 * Compaction backed off due to watermark checks for order-0
> +	 * so the regular reclaim has to try harder and reclaim something
> +	 * Retry only if it looks like reclaim might have a chance.
>  	 */
> -	if (compaction_failed(compact_result)) {
> -		if (*compact_priority > 0) {
> -			(*compact_priority)--;
> -			return true;
> -		}
> -		return false;
> -	}
> +	if (compact_result == COMPACT_SKIPPED)
> +		return compaction_zonelist_suitable(ac, order, alloc_flags);
>  
>  	/*
> -	 * make sure the compaction wasn't deferred or didn't bail out early
> -	 * due to locks contention before we declare that we should give up.
> -	 * But do not retry if the given zonelist is not suitable for
> -	 * compaction.
> +	 * Compaction could have withdrawn early or skip some zones or
> +	 * pageblocks. We were asked to retry, which means the allocation
> +	 * should try really hard, so increase the priority if possible.
>  	 */
> -	if (compaction_withdrawn(compact_result))
> -		return compaction_zonelist_suitable(ac, order, alloc_flags);
> +	if (*compact_priority > 0) {
> +		(*compact_priority)--;
> +		return true;
> +	}
>  
>  	/*
> +	 * The remaining possibility is that compaction made progress and
> +	 * created a high-order page, but it was allocated by somebody else.
> +	 * To prevent thrashing, limit the number of retries in such case.
>  	 * !costly requests are much more important than __GFP_REPEAT
>  	 * costly ones because they are de facto nofail and invoke OOM
>  	 * killer to move on while costly can fail and users are ready
> @@ -3527,6 +3526,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  						struct alloc_context *ac)
>  {
>  	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
> +	bool should_retry;
>  	struct page *page = NULL;
>  	unsigned int alloc_flags;
>  	unsigned long did_some_progress;
> @@ -3695,22 +3695,22 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>  	else
>  		no_progress_loops++;
>  
> -	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
> -				 did_some_progress > 0, no_progress_loops))
> -		goto retry;
> -
> +	should_retry = should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
> +				 did_some_progress > 0, no_progress_loops);
>  	/*
>  	 * It doesn't make any sense to retry for the compaction if the order-0
>  	 * reclaim is not able to make any progress because the current
>  	 * implementation of the compaction depends on the sufficient amount
>  	 * of free memory (see __compaction_suitable)
>  	 */
> -	if (did_some_progress > 0 &&
> -			should_compact_retry(ac, order, alloc_flags,
> +	if (did_some_progress > 0)
> +		should_retry |= should_compact_retry(ac, order, alloc_flags,
>  				compact_result, &compact_priority,
> -				compaction_retries))
> +				compaction_retries);
> +	if (should_retry)
>  		goto retry;

Hmm... it looks odd that we check should_compact_retry() when
did_some_progress > 0. If system is full of anonymous memory and we
don't have swap, we can't reclaim anything but we can compact.

And, your patchset make me think that it's better to separate retry
loop for order-0 allocation and high-order allocation completely.

Current code is a mix of these two types of criteria and is hard to
follow. Your patchset make it simpler but we can do better if
separating them completely. Any thought?

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 02/13] mm, page_alloc: set alloc_flags only once in slowpath
  2016-05-31  6:20       ` Joonsoo Kim
@ 2016-05-31  7:59         ` Vlastimil Babka
  2016-06-02  1:50           ` Joonsoo Kim
  0 siblings, 1 reply; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-31  7:59 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Tetsuo Handa, mhocko, linux-mm, akpm, riel, rientjes, mgorman,
	hannes, linux-kernel, torvalds

On 05/31/2016 08:20 AM, Joonsoo Kim wrote:
>> >From 68f09f1d4381c7451238b4575557580380d8bf30 Mon Sep 17 00:00:00 2001
>> From: Vlastimil Babka <vbabka@suse.cz>
>> Date: Fri, 29 Apr 2016 11:51:17 +0200
>> Subject: [RFC 02/13] mm, page_alloc: set alloc_flags only once in slowpath
>>
>> In __alloc_pages_slowpath(), alloc_flags doesn't change after it's initialized,
>> so move the initialization above the retry: label. Also make the comment above
>> the initialization more descriptive.
>>
>> The only exception in the alloc_flags being constant is ALLOC_NO_WATERMARKS,
>> which may change due to TIF_MEMDIE being set on the allocating thread. We can
>> fix this, and make the code simpler and a bit more effective at the same time,
>> by moving the part that determines ALLOC_NO_WATERMARKS from
>> gfp_to_alloc_flags() to gfp_pfmemalloc_allowed(). This means we don't have to
>> mask out ALLOC_NO_WATERMARKS in several places in __alloc_pages_slowpath()
>> anymore.  The only test for the flag can instead call gfp_pfmemalloc_allowed().
>
> Your patch looks correct to me but it makes me wonder something.
> Why do we need to mask out ALLOC_NO_WATERMARKS in several places? If
> some requestors have ALLOC_NO_WATERMARKS flag, he will
> eventually do ALLOC_NO_WATERMARKS allocation in retry loop. I don't
> understand what's the merit of masking out it.

I can think of a reason. If e.g. reclaim makes free pages above 
watermark in the 4th zone in the zonelist, we would like the subsequent 
get_page_from_freelist() to succeed in that 4th zone. Passing 
ALLOC_NO_WATERMARKS there would likely succeed in the first zone, 
needlessly below the watermark.

But this actually makes no difference, since the ALLOC_NO_WATERMARKS 
attempt precedes reclaim/compaction attempts. It probably shouldn't...

> Thanks.
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 03/13] mm, page_alloc: don't retry initial attempt in slowpath
  2016-05-31  6:25   ` Joonsoo Kim
@ 2016-05-31 12:03     ` Vlastimil Babka
  0 siblings, 0 replies; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-31 12:03 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Michal Hocko, linux-mm, Andrew Morton, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On 05/31/2016 08:25 AM, Joonsoo Kim wrote:
> On Tue, May 10, 2016 at 09:35:53AM +0200, Vlastimil Babka wrote:
>> After __alloc_pages_slowpath() sets up new alloc_flags and wakes up kswapd, it
>> first tries get_page_from_freelist() with the new alloc_flags, as it may
>> succeed e.g. due to using min watermark instead of low watermark. This attempt
>> does not have to be retried on each loop, since direct reclaim, direct
>> compaction and oom call get_page_from_freelist() themselves.
>
> Hmm... there is a corner case. If did_some_progress is 0 or compaction
> is deferred, get_page_from_freelist() isn't called. But, we can
> succeed to allocate memory since there is a kswapd that reclaims
> memory in background.

Hmm good point. I think the cleanest solution is to let 
__alloc_pages_direct_reclaim attempt regardless of did_some_progress.

> Thanks.
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 12/13] mm, compaction: more reliably increase direct compaction priority
  2016-05-31  6:37   ` Joonsoo Kim
@ 2016-05-31 12:07     ` Vlastimil Babka
  2016-05-31 12:29       ` Vlastimil Babka
  0 siblings, 1 reply; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-31 12:07 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Michal Hocko, linux-mm, Andrew Morton, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On 05/31/2016 08:37 AM, Joonsoo Kim wrote:
> On Tue, May 10, 2016 at 09:36:02AM +0200, Vlastimil Babka wrote:
>> During reclaim/compaction loop, compaction priority can be increased by the
>> should_compact_retry() function, but the current code is not optimal for
>> several reasons:
>>
>> - priority is only increased when compaction_failed() is true, which means
>>   that compaction has scanned the whole zone. This may not happen even after
>>   multiple attempts with the lower priority due to parallel activity, so we
>>   might needlessly struggle on the lower priority.
>>
>> - should_compact_retry() is only called when should_reclaim_retry() returns
>>   false. This means that compaction priority cannot get increased as long
>>   as reclaim makes sufficient progress. Theoretically, reclaim should stop
>>   retrying for high-order allocations as long as the high-order page doesn't
>>   exist but due to races, this may result in spurious retries when the
>>   high-order page momentarily does exist.
>>
>> We can remove these corner cases by making sure that should_compact_retry() is
>> always called, and increases compaction priority if possible. Examining further
>> the compaction result can be done only after reaching the highest priority.
>> This is a simple solution and we don't need to worry about reaching the highest
>> priority "too soon" here - when should_compact_retry() is called it means that
>> the system is already struggling and the allocation is supposed to either try
>> as hard as possible, or it cannot fail at all. There's not much point staying
>> at lower priorities with heuristics that may result in only partial compaction.
>>
>> The only exception here is the COMPACT_SKIPPED result, which means that
>> compaction could not run at all due to failing order-0 watermarks. In that
>> case, don't increase compaction priority, and check if compaction could proceed
>> when everything reclaimable was reclaimed. Before this patch, this was tied to
>> compaction_withdrawn(), but the other results considered there are in fact only
>> due to low compaction priority so we can ignore them thanks to the patch.
>>
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
>> ---
>>  mm/page_alloc.c | 46 +++++++++++++++++++++++-----------------------
>>  1 file changed, 23 insertions(+), 23 deletions(-)
>>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index aa9c39a7f40a..623027fb8121 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -3248,28 +3248,27 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
>>  		return false;
>>
>>  	/*
>> -	 * compaction considers all the zone as desperately out of memory
>> -	 * so it doesn't really make much sense to retry except when the
>> -	 * failure could be caused by insufficient priority
>> +	 * Compaction backed off due to watermark checks for order-0
>> +	 * so the regular reclaim has to try harder and reclaim something
>> +	 * Retry only if it looks like reclaim might have a chance.
>>  	 */
>> -	if (compaction_failed(compact_result)) {
>> -		if (*compact_priority > 0) {
>> -			(*compact_priority)--;
>> -			return true;
>> -		}
>> -		return false;
>> -	}
>> +	if (compact_result == COMPACT_SKIPPED)
>> +		return compaction_zonelist_suitable(ac, order, alloc_flags);
>>
>>  	/*
>> -	 * make sure the compaction wasn't deferred or didn't bail out early
>> -	 * due to locks contention before we declare that we should give up.
>> -	 * But do not retry if the given zonelist is not suitable for
>> -	 * compaction.
>> +	 * Compaction could have withdrawn early or skip some zones or
>> +	 * pageblocks. We were asked to retry, which means the allocation
>> +	 * should try really hard, so increase the priority if possible.
>>  	 */
>> -	if (compaction_withdrawn(compact_result))
>> -		return compaction_zonelist_suitable(ac, order, alloc_flags);
>> +	if (*compact_priority > 0) {
>> +		(*compact_priority)--;
>> +		return true;
>> +	}
>>
>>  	/*
>> +	 * The remaining possibility is that compaction made progress and
>> +	 * created a high-order page, but it was allocated by somebody else.
>> +	 * To prevent thrashing, limit the number of retries in such case.
>>  	 * !costly requests are much more important than __GFP_REPEAT
>>  	 * costly ones because they are de facto nofail and invoke OOM
>>  	 * killer to move on while costly can fail and users are ready
>> @@ -3527,6 +3526,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>>  						struct alloc_context *ac)
>>  {
>>  	bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;
>> +	bool should_retry;
>>  	struct page *page = NULL;
>>  	unsigned int alloc_flags;
>>  	unsigned long did_some_progress;
>> @@ -3695,22 +3695,22 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>>  	else
>>  		no_progress_loops++;
>>
>> -	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
>> -				 did_some_progress > 0, no_progress_loops))
>> -		goto retry;
>> -
>> +	should_retry = should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
>> +				 did_some_progress > 0, no_progress_loops);
>>  	/*
>>  	 * It doesn't make any sense to retry for the compaction if the order-0
>>  	 * reclaim is not able to make any progress because the current
>>  	 * implementation of the compaction depends on the sufficient amount
>>  	 * of free memory (see __compaction_suitable)
>>  	 */
>> -	if (did_some_progress > 0 &&
>> -			should_compact_retry(ac, order, alloc_flags,
>> +	if (did_some_progress > 0)
>> +		should_retry |= should_compact_retry(ac, order, alloc_flags,
>>  				compact_result, &compact_priority,
>> -				compaction_retries))
>> +				compaction_retries);
>> +	if (should_retry)
>>  		goto retry;
>
> Hmm... it looks odd that we check should_compact_retry() when
> did_some_progress > 0. If system is full of anonymous memory and we
> don't have swap, we can't reclaim anything but we can compact.

Right, thanks.

> And, your patchset make me think that it's better to separate retry
> loop for order-0 allocation and high-order allocation completely.

I don't know, the loops is already large enough. Basically duplicating 
it sounds like a lot of bloat. Hiding the order-specific decisions in 
helpers sounds better.

> Current code is a mix of these two types of criteria and is hard to
> follow. Your patchset make it simpler but we can do better if
> separating them completely. Any thought?
>
> Thanks.
>

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 12/13] mm, compaction: more reliably increase direct compaction priority
  2016-05-31 12:07     ` Vlastimil Babka
@ 2016-05-31 12:29       ` Vlastimil Babka
  2016-06-02  2:50         ` Joonsoo Kim
  0 siblings, 1 reply; 61+ messages in thread
From: Vlastimil Babka @ 2016-05-31 12:29 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Michal Hocko, linux-mm, Andrew Morton, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On 05/31/2016 02:07 PM, Vlastimil Babka wrote:
> On 05/31/2016 08:37 AM, Joonsoo Kim wrote:
>>> @@ -3695,22 +3695,22 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
>>>  	else
>>>  		no_progress_loops++;
>>>
>>> -	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
>>> -				 did_some_progress > 0, no_progress_loops))
>>> -		goto retry;
>>> -
>>> +	should_retry = should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
>>> +				 did_some_progress > 0, no_progress_loops);
>>>  	/*
>>>  	 * It doesn't make any sense to retry for the compaction if the order-0
>>>  	 * reclaim is not able to make any progress because the current
>>>  	 * implementation of the compaction depends on the sufficient amount
>>>  	 * of free memory (see __compaction_suitable)
>>>  	 */
>>> -	if (did_some_progress > 0 &&
>>> -			should_compact_retry(ac, order, alloc_flags,
>>> +	if (did_some_progress > 0)
>>> +		should_retry |= should_compact_retry(ac, order, alloc_flags,
>>>  				compact_result, &compact_priority,
>>> -				compaction_retries))
>>> +				compaction_retries);
>>> +	if (should_retry)
>>>  		goto retry;
>>
>> Hmm... it looks odd that we check should_compact_retry() when
>> did_some_progress > 0. If system is full of anonymous memory and we
>> don't have swap, we can't reclaim anything but we can compact.
>
> Right, thanks.

Hmm on the other hand, should_compact_retry will assume (in 
compaction_zonelist_suitable()) that reclaimable memory is actually 
reclaimable. If there's nothing to tell us that it actually isn't, if we 
drop the reclaim progress requirement. That's risking an infinite loop?

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 02/13] mm, page_alloc: set alloc_flags only once in slowpath
  2016-05-31  7:59         ` Vlastimil Babka
@ 2016-06-02  1:50           ` Joonsoo Kim
  0 siblings, 0 replies; 61+ messages in thread
From: Joonsoo Kim @ 2016-06-02  1:50 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Tetsuo Handa, mhocko, linux-mm, akpm, riel, rientjes, mgorman,
	hannes, linux-kernel, torvalds

On Tue, May 31, 2016 at 09:59:36AM +0200, Vlastimil Babka wrote:
> On 05/31/2016 08:20 AM, Joonsoo Kim wrote:
> >>>From 68f09f1d4381c7451238b4575557580380d8bf30 Mon Sep 17 00:00:00 2001
> >>From: Vlastimil Babka <vbabka@suse.cz>
> >>Date: Fri, 29 Apr 2016 11:51:17 +0200
> >>Subject: [RFC 02/13] mm, page_alloc: set alloc_flags only once in slowpath
> >>
> >>In __alloc_pages_slowpath(), alloc_flags doesn't change after it's initialized,
> >>so move the initialization above the retry: label. Also make the comment above
> >>the initialization more descriptive.
> >>
> >>The only exception in the alloc_flags being constant is ALLOC_NO_WATERMARKS,
> >>which may change due to TIF_MEMDIE being set on the allocating thread. We can
> >>fix this, and make the code simpler and a bit more effective at the same time,
> >>by moving the part that determines ALLOC_NO_WATERMARKS from
> >>gfp_to_alloc_flags() to gfp_pfmemalloc_allowed(). This means we don't have to
> >>mask out ALLOC_NO_WATERMARKS in several places in __alloc_pages_slowpath()
> >>anymore.  The only test for the flag can instead call gfp_pfmemalloc_allowed().
> >
> >Your patch looks correct to me but it makes me wonder something.
> >Why do we need to mask out ALLOC_NO_WATERMARKS in several places? If
> >some requestors have ALLOC_NO_WATERMARKS flag, he will
> >eventually do ALLOC_NO_WATERMARKS allocation in retry loop. I don't
> >understand what's the merit of masking out it.
> 
> I can think of a reason. If e.g. reclaim makes free pages above
> watermark in the 4th zone in the zonelist, we would like the
> subsequent get_page_from_freelist() to succeed in that 4th zone.
> Passing ALLOC_NO_WATERMARKS there would likely succeed in the first
> zone, needlessly below the watermark.

Hmm... it's intent would be like as your guess but I think that it's
not correct. There would be not much disadvantage even if we allocate the
freepage from 1st zone where watermark is unmet, since 1st zone would
be higher zone than 4th zone and who can utilize higher zone can
utilize lower zone. There are many corner case handlings and we can't
be sure that we would eventually do the ALLOC_NO_WATERMARKS allocation
attempt again. And, ALLOC_NO_WATERMARKS means that it is a very
importatn task so we need to make it successful as fast as possible. I
think that allowing ALLOC_NO_WATERMARKS as much as possible is better.

So, IMO, keeping ALLOC_NO_WATERMARKS in alloc_flags and don't mask out it
as much as possible would be a way to go.

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

* Re: [RFC 12/13] mm, compaction: more reliably increase direct compaction priority
  2016-05-31 12:29       ` Vlastimil Babka
@ 2016-06-02  2:50         ` Joonsoo Kim
  0 siblings, 0 replies; 61+ messages in thread
From: Joonsoo Kim @ 2016-06-02  2:50 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Michal Hocko, linux-mm, Andrew Morton, Rik van Riel,
	David Rientjes, Mel Gorman, Johannes Weiner, Tetsuo Handa,
	linux-kernel, Linus Torvalds

On Tue, May 31, 2016 at 02:29:24PM +0200, Vlastimil Babka wrote:
> On 05/31/2016 02:07 PM, Vlastimil Babka wrote:
> >On 05/31/2016 08:37 AM, Joonsoo Kim wrote:
> >>>@@ -3695,22 +3695,22 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
> >>> 	else
> >>> 		no_progress_loops++;
> >>>
> >>>-	if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
> >>>-				 did_some_progress > 0, no_progress_loops))
> >>>-		goto retry;
> >>>-
> >>>+	should_retry = should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
> >>>+				 did_some_progress > 0, no_progress_loops);
> >>> 	/*
> >>> 	 * It doesn't make any sense to retry for the compaction if the order-0
> >>> 	 * reclaim is not able to make any progress because the current
> >>> 	 * implementation of the compaction depends on the sufficient amount
> >>> 	 * of free memory (see __compaction_suitable)
> >>> 	 */
> >>>-	if (did_some_progress > 0 &&
> >>>-			should_compact_retry(ac, order, alloc_flags,
> >>>+	if (did_some_progress > 0)
> >>>+		should_retry |= should_compact_retry(ac, order, alloc_flags,
> >>> 				compact_result, &compact_priority,
> >>>-				compaction_retries))
> >>>+				compaction_retries);
> >>>+	if (should_retry)
> >>> 		goto retry;
> >>
> >>Hmm... it looks odd that we check should_compact_retry() when
> >>did_some_progress > 0. If system is full of anonymous memory and we
> >>don't have swap, we can't reclaim anything but we can compact.
> >
> >Right, thanks.
> 
> Hmm on the other hand, should_compact_retry will assume (in
> compaction_zonelist_suitable()) that reclaimable memory is actually
> reclaimable. If there's nothing to tell us that it actually isn't,
> if we drop the reclaim progress requirement. That's risking an
> infinite loop?

You are right. I hope this retry logic will be robust to cover all
the theoretical situations but it looks not easy. Sigh...

Thanks.

^ permalink raw reply	[flat|nested] 61+ messages in thread

end of thread, other threads:[~2016-06-02  2:49 UTC | newest]

Thread overview: 61+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-10  7:35 [RFC 00/13] make direct compaction more deterministic Vlastimil Babka
2016-05-10  7:35 ` [RFC 01/13] mm, compaction: don't isolate PageWriteback pages in MIGRATE_SYNC_LIGHT mode Vlastimil Babka
2016-05-11 12:40   ` Michal Hocko
2016-05-10  7:35 ` [RFC 02/13] mm, page_alloc: set alloc_flags only once in slowpath Vlastimil Babka
2016-05-10 11:28   ` Tetsuo Handa
2016-05-10 12:30     ` Vlastimil Babka
2016-05-12 12:41       ` Michal Hocko
2016-05-31  6:20       ` Joonsoo Kim
2016-05-31  7:59         ` Vlastimil Babka
2016-06-02  1:50           ` Joonsoo Kim
2016-05-10  7:35 ` [RFC 03/13] mm, page_alloc: don't retry initial attempt " Vlastimil Babka
2016-05-12 12:48   ` Michal Hocko
2016-05-31  6:25   ` Joonsoo Kim
2016-05-31 12:03     ` Vlastimil Babka
2016-05-10  7:35 ` [RFC 04/13] mm, page_alloc: restructure direct compaction handling " Vlastimil Babka
2016-05-12 13:29   ` Michal Hocko
2016-05-13  8:10     ` Vlastimil Babka
2016-05-13  8:31       ` Michal Hocko
2016-05-10  7:35 ` [RFC 05/13] mm, page_alloc: make THP-specific decisions more generic Vlastimil Babka
2016-05-12 13:43   ` Michal Hocko
2016-05-10  7:35 ` [RFC 06/13] mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations Vlastimil Babka
2016-05-12 16:20   ` Michal Hocko
2016-05-13  8:23     ` Vlastimil Babka
2016-05-13 12:05       ` Michal Hocko
2016-05-18 11:59         ` Vlastimil Babka
2016-05-18 15:24           ` Michal Hocko
2016-05-20 13:57             ` Vlastimil Babka
2016-05-23  8:39               ` Michal Hocko
2016-05-10  7:35 ` [RFC 07/13] mm, compaction: introduce direct compaction priority Vlastimil Babka
2016-05-13 12:37   ` Michal Hocko
2016-05-10  7:35 ` [RFC 08/13] mm, compaction: simplify contended compaction handling Vlastimil Babka
2016-05-13 13:09   ` Michal Hocko
2016-05-16  7:10     ` Vlastimil Babka
2016-05-10  7:35 ` [RFC 09/13] mm, compaction: make whole_zone flag ignore cached scanner positions Vlastimil Babka
2016-05-13 13:23   ` Michal Hocko
2016-05-10  7:36 ` [RFC 10/13] mm, compaction: cleanup unused functions Vlastimil Babka
2016-05-10  7:36 ` [RFC 11/13] mm, compaction: add the ultimate direct compaction priority Vlastimil Babka
2016-05-13 13:38   ` Michal Hocko
2016-05-16  7:17     ` Vlastimil Babka
2016-05-16  8:11       ` Michal Hocko
2016-05-18 12:46       ` Vlastimil Babka
2016-05-10  7:36 ` [RFC 12/13] mm, compaction: more reliably increase " Vlastimil Babka
2016-05-10 12:55   ` Vlastimil Babka
2016-05-13 14:15   ` Michal Hocko
2016-05-16  7:31     ` Vlastimil Babka
2016-05-16  8:14       ` Michal Hocko
2016-05-16  9:27         ` Vlastimil Babka
2016-05-16  9:52           ` Michal Hocko
2016-05-31  6:37   ` Joonsoo Kim
2016-05-31 12:07     ` Vlastimil Babka
2016-05-31 12:29       ` Vlastimil Babka
2016-06-02  2:50         ` Joonsoo Kim
2016-05-10  7:36 ` [RFC 13/13] mm, compaction: fix and improve watermark handling Vlastimil Babka
2016-05-16  9:25   ` Michal Hocko
2016-05-16  9:50     ` Vlastimil Babka
2016-05-16 12:30       ` Michal Hocko
2016-05-18 13:50     ` Mel Gorman
2016-05-18 14:27       ` Michal Hocko
2016-05-18 14:40         ` Mel Gorman
2016-05-17 20:01 ` [RFC 00/13] make direct compaction more deterministic Michal Hocko
2016-05-18  7:19   ` Vlastimil Babka

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).