linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/3] fix hugetlb page allocation stalls
@ 2019-07-24 17:50 Mike Kravetz
  2019-07-24 17:50 ` [RFC PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection Mike Kravetz
                   ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Mike Kravetz @ 2019-07-24 17:50 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hillf Danton, Michal Hocko, Mel Gorman, Vlastimil Babka,
	Johannes Weiner, Andrew Morton, Mike Kravetz

Allocation of hugetlb pages via sysctl or procfs can stall for minutes
or hours.  A simple example on a two node system with 8GB of memory is
as follows:

echo 4096 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
echo 4096 > /proc/sys/vm/nr_hugepages

Obviously, both allocation attempts will fall short of their 8GB goal.
However, one or both of these commands may stall and not be interruptible.
The issues were discussed in this thread [1].

This series attempts to address the issues causing the stalls.  There are
two distinct issues, and an optimization.  For the reclaim and compaction
issues, suggestions were made to simply remove some existing code.  However,
the impact of such changes would be hard to address.  This series takes a
more conservative approach in an attempt to minimally impact existing
workloads.  The question of which approach is better is debatable, hence the
RFC designation.  Patches in the series address these issues:

1) Should_continue_reclaim returns true too often.
   Michal Hocko suggested removing the special casing for __GFP_RETRY_MAYFAIL
   in should_continue_reclaim.  This does indeed address the hugetlb
   allocations, but may impact other users.  Hillf Danton restructured
   the code in such a way to preserve much of the original semantics.  Hillf's
   patch also addresses hugetlb allocation issues and is included here.

2) With 1) addressed, should_compact_retry returns true too often.
   Mel Gorman suggested the removal of the compaction_zonelist_suitable() call.
   This routine/call was introduced by Michal Hocko for a specific use case.
   Therefore, removal would likely break that use case.  While examining the
   reasons for compaction_withdrawn() as in [2], it appears that there are
   several places where we should be using MIN_COMPACT_COSTLY_PRIORITY instead
   of MIN_COMPACT_PRIORITY for costly allocations.  This patch makes those
   changes which also causes more appropriate should_compact_retry behavior
   for hugetlb allocations.

3) This is simply an optimization of the allocation code for hugetlb pool
   pages.  After first __GFP_RETRY_MAYFAIL allocation failure on a node,
   it drops the __GFP_RETRY_MAYFAIL flag.


[1] http://lkml.kernel.org/r/d38a095e-dc39-7e82-bb76-2c9247929f07@oracle.com
[2] http://lkml.kernel.org/r/6377c199-2b9e-e30d-a068-c304d8a3f706@oracle.com

Hillf Danton (1):
  mm, reclaim: make should_continue_reclaim perform dryrun detection

Mike Kravetz (2):
  mm, compaction: use MIN_COMPACT_COSTLY_PRIORITY everywhere for costly
    orders
  hugetlbfs: don't retry when pool page allocations start to fail

 mm/compaction.c | 18 +++++++---
 mm/hugetlb.c    | 87 +++++++++++++++++++++++++++++++++++++++++++------
 mm/vmscan.c     | 28 ++++++++--------
 3 files changed, 105 insertions(+), 28 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection
  2019-07-24 17:50 [RFC PATCH 0/3] fix hugetlb page allocation stalls Mike Kravetz
@ 2019-07-24 17:50 ` Mike Kravetz
  2019-07-25  8:05   ` Mel Gorman
  2019-07-24 17:50 ` [RFC PATCH 2/3] mm, compaction: use MIN_COMPACT_COSTLY_PRIORITY everywhere for costly orders Mike Kravetz
  2019-07-24 17:50 ` [RFC PATCH 3/3] hugetlbfs: don't retry when pool page allocations start to fail Mike Kravetz
  2 siblings, 1 reply; 26+ messages in thread
From: Mike Kravetz @ 2019-07-24 17:50 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hillf Danton, Michal Hocko, Mel Gorman, Vlastimil Babka,
	Johannes Weiner, Andrew Morton, Mike Kravetz

From: Hillf Danton <hdanton@sina.com>

Address the issue of should_continue_reclaim continuing true too often
for __GFP_RETRY_MAYFAIL attempts when !nr_reclaimed and nr_scanned.
This could happen during hugetlb page allocation causing stalls for
minutes or hours.

Restructure code so that false will be returned in this case even if
there are plenty of inactive pages.

Need better description from Hillf Danton
Need SOB from Hillf Danton
---
 mm/vmscan.c | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f4fd02ae233e..484b6b1a954e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2673,18 +2673,6 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 			return false;
 	}
 
-	/*
-	 * If we have not reclaimed enough pages for compaction and the
-	 * inactive lists are large enough, continue reclaiming
-	 */
-	pages_for_compaction = compact_gap(sc->order);
-	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
-	if (get_nr_swap_pages() > 0)
-		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
-	if (sc->nr_reclaimed < pages_for_compaction &&
-			inactive_lru_pages > pages_for_compaction)
-		return true;
-
 	/* If compaction would go ahead or the allocation would succeed, stop */
 	for (z = 0; z <= sc->reclaim_idx; z++) {
 		struct zone *zone = &pgdat->node_zones[z];
@@ -2700,7 +2688,21 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 			;
 		}
 	}
-	return true;
+
+	/*
+	 * If we have not reclaimed enough pages for compaction and the
+	 * inactive lists are large enough, continue reclaiming
+	 */
+	pages_for_compaction = compact_gap(sc->order);
+	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
+	if (get_nr_swap_pages() > 0)
+		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
+
+	return inactive_lru_pages > pages_for_compaction &&
+		/*
+		 * avoid dryrun with plenty of inactive pages
+		 */
+		nr_scanned && nr_reclaimed;
 }
 
 static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 2/3] mm, compaction: use MIN_COMPACT_COSTLY_PRIORITY everywhere for costly orders
  2019-07-24 17:50 [RFC PATCH 0/3] fix hugetlb page allocation stalls Mike Kravetz
  2019-07-24 17:50 ` [RFC PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection Mike Kravetz
@ 2019-07-24 17:50 ` Mike Kravetz
  2019-07-25  8:06   ` Mel Gorman
  2019-07-31 12:06   ` Vlastimil Babka
  2019-07-24 17:50 ` [RFC PATCH 3/3] hugetlbfs: don't retry when pool page allocations start to fail Mike Kravetz
  2 siblings, 2 replies; 26+ messages in thread
From: Mike Kravetz @ 2019-07-24 17:50 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hillf Danton, Michal Hocko, Mel Gorman, Vlastimil Babka,
	Johannes Weiner, Andrew Morton, Mike Kravetz

For PAGE_ALLOC_COSTLY_ORDER allocations, MIN_COMPACT_COSTLY_PRIORITY is
minimum (highest priority).  Other places in the compaction code key off
of MIN_COMPACT_PRIORITY.  Costly order allocations will never get to
MIN_COMPACT_PRIORITY.  Therefore, some conditions will never be met for
costly order allocations.

This was observed when hugetlb allocations could stall for minutes or
hours when should_compact_retry() would return true more often then it
should.  Specifically, this was in the case where compact_result was
COMPACT_DEFERRED and COMPACT_PARTIAL_SKIPPED and no progress was being
made.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/compaction.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/mm/compaction.c b/mm/compaction.c
index 952dc2fb24e5..325b746068d1 100644
--- a/mm/compaction.c
+++ b/mm/compaction.c
@@ -2294,9 +2294,15 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
 		.alloc_flags = alloc_flags,
 		.classzone_idx = classzone_idx,
 		.direct_compaction = true,
-		.whole_zone = (prio == MIN_COMPACT_PRIORITY),
-		.ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY),
-		.ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY)
+		.whole_zone = ((order > PAGE_ALLOC_COSTLY_ORDER) ?
+				(prio == MIN_COMPACT_COSTLY_PRIORITY) :
+				(prio == MIN_COMPACT_PRIORITY)),
+		.ignore_skip_hint = ((order > PAGE_ALLOC_COSTLY_ORDER) ?
+				(prio == MIN_COMPACT_COSTLY_PRIORITY) :
+				(prio == MIN_COMPACT_PRIORITY)),
+		.ignore_block_suitable = ((order > PAGE_ALLOC_COSTLY_ORDER) ?
+				(prio == MIN_COMPACT_COSTLY_PRIORITY) :
+				(prio == MIN_COMPACT_PRIORITY))
 	};
 	struct capture_control capc = {
 		.cc = &cc,
@@ -2338,6 +2344,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 	int may_perform_io = gfp_mask & __GFP_IO;
 	struct zoneref *z;
 	struct zone *zone;
+	int min_priority;
 	enum compact_result rc = COMPACT_SKIPPED;
 
 	/*
@@ -2350,12 +2357,13 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
 	trace_mm_compaction_try_to_compact_pages(order, gfp_mask, prio);
 
 	/* Compact each zone in the list */
+	min_priority = (order > PAGE_ALLOC_COSTLY_ORDER) ?
+			MIN_COMPACT_COSTLY_PRIORITY : MIN_COMPACT_PRIORITY;
 	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
 								ac->nodemask) {
 		enum compact_result status;
 
-		if (prio > MIN_COMPACT_PRIORITY
-					&& compaction_deferred(zone, order)) {
+		if (prio > min_priority && compaction_deferred(zone, order)) {
 			rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
 			continue;
 		}
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [RFC PATCH 3/3] hugetlbfs: don't retry when pool page allocations start to fail
  2019-07-24 17:50 [RFC PATCH 0/3] fix hugetlb page allocation stalls Mike Kravetz
  2019-07-24 17:50 ` [RFC PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection Mike Kravetz
  2019-07-24 17:50 ` [RFC PATCH 2/3] mm, compaction: use MIN_COMPACT_COSTLY_PRIORITY everywhere for costly orders Mike Kravetz
@ 2019-07-24 17:50 ` Mike Kravetz
  2019-07-25  8:13   ` Mel Gorman
  2 siblings, 1 reply; 26+ messages in thread
From: Mike Kravetz @ 2019-07-24 17:50 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hillf Danton, Michal Hocko, Mel Gorman, Vlastimil Babka,
	Johannes Weiner, Andrew Morton, Mike Kravetz

When allocating hugetlbfs pool pages via /proc/sys/vm/nr_hugepages,
the pages will be interleaved between all nodes of the system.  If
nodes are not equal, it is quite possible for one node to fill up
before the others.  When this happens, the code still attempts to
allocate pages from the full node.  This results in calls to direct
reclaim and compaction which slow things down considerably.

When allocating pool pages, note the state of the previous allocation
for each node.  If previous allocation failed, do not use the
aggressive retry algorithm on successive attempts.  The allocation
will still succeed if there is memory available, but it will not try
as hard to free up memory.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 77 insertions(+), 10 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ede7e7f5d1ab..f3c50344a9b4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1405,12 +1405,27 @@ pgoff_t __basepage_index(struct page *page)
 }
 
 static struct page *alloc_buddy_huge_page(struct hstate *h,
-		gfp_t gfp_mask, int nid, nodemask_t *nmask)
+		gfp_t gfp_mask, int nid, nodemask_t *nmask,
+		nodemask_t *node_alloc_noretry)
 {
 	int order = huge_page_order(h);
 	struct page *page;
+	bool alloc_try_hard = true;
 
-	gfp_mask |= __GFP_COMP|__GFP_RETRY_MAYFAIL|__GFP_NOWARN;
+	/*
+	 * By default we always try hard to allocate the page with
+	 * __GFP_RETRY_MAYFAIL flag.  However, if we are allocating pages in
+	 * a loop (to adjust global huge page counts) and previous allocation
+	 * failed, do not continue to try hard on the same node.  Use the
+	 * node_alloc_noretry bitmap to manage this state information.
+	 */
+	if (node_alloc_noretry && node_isset(nid, *node_alloc_noretry))
+		alloc_try_hard = false;
+	gfp_mask |= __GFP_COMP|__GFP_NOWARN;
+	if (alloc_try_hard)
+		gfp_mask |= __GFP_RETRY_MAYFAIL;
+	else
+		gfp_mask |= __GFP_NORETRY;
 	if (nid == NUMA_NO_NODE)
 		nid = numa_mem_id();
 	page = __alloc_pages_nodemask(gfp_mask, order, nid, nmask);
@@ -1419,6 +1434,22 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
 	else
 		__count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
 
+	/*
+	 * If we did not specify __GFP_RETRY_MAYFAIL, but still got a page this
+	 * indicates an overall state change.  Clear bit so that we resume
+	 * normal 'try hard' allocations.
+	 */
+	if (node_alloc_noretry && page && !alloc_try_hard)
+		node_clear(nid, *node_alloc_noretry);
+
+	/*
+	 * If we tried hard to get a page but failed, set bit so that
+	 * subsequent attempts will not try as hard until there is an
+	 * overall state change.
+	 */
+	if (node_alloc_noretry && !page && alloc_try_hard)
+		node_set(nid, *node_alloc_noretry);
+
 	return page;
 }
 
@@ -1427,7 +1458,8 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
  * should use this function to get new hugetlb pages
  */
 static struct page *alloc_fresh_huge_page(struct hstate *h,
-		gfp_t gfp_mask, int nid, nodemask_t *nmask)
+		gfp_t gfp_mask, int nid, nodemask_t *nmask,
+		nodemask_t *node_alloc_noretry)
 {
 	struct page *page;
 
@@ -1435,7 +1467,7 @@ static struct page *alloc_fresh_huge_page(struct hstate *h,
 		page = alloc_gigantic_page(h, gfp_mask, nid, nmask);
 	else
 		page = alloc_buddy_huge_page(h, gfp_mask,
-				nid, nmask);
+				nid, nmask, node_alloc_noretry);
 	if (!page)
 		return NULL;
 
@@ -1450,14 +1482,16 @@ static struct page *alloc_fresh_huge_page(struct hstate *h,
  * Allocates a fresh page to the hugetlb allocator pool in the node interleaved
  * manner.
  */
-static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
+static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
+				nodemask_t *node_alloc_noretry)
 {
 	struct page *page;
 	int nr_nodes, node;
 	gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
 
 	for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
-		page = alloc_fresh_huge_page(h, gfp_mask, node, nodes_allowed);
+		page = alloc_fresh_huge_page(h, gfp_mask, node, nodes_allowed,
+						node_alloc_noretry);
 		if (page)
 			break;
 	}
@@ -1601,7 +1635,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
 		goto out_unlock;
 	spin_unlock(&hugetlb_lock);
 
-	page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask);
+	page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL);
 	if (!page)
 		return NULL;
 
@@ -1637,7 +1671,7 @@ struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
 	if (hstate_is_gigantic(h))
 		return NULL;
 
-	page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask);
+	page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL);
 	if (!page)
 		return NULL;
 
@@ -2207,13 +2241,31 @@ static void __init gather_bootmem_prealloc(void)
 static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 {
 	unsigned long i;
+	nodemask_t *node_alloc_noretry;
+
+	if (!hstate_is_gigantic(h)) {
+		/*
+		 * bit mask controlling how hard we retry per-node
+		 * allocations.
+		 */
+		node_alloc_noretry = kmalloc(sizeof(*node_alloc_noretry),
+						GFP_KERNEL | __GFP_NORETRY);
+	} else {
+		/* allocations done at boot time */
+		node_alloc_noretry = NULL;
+	}
+
+	/* bit mask controlling how hard we retry per-node allocations */
+	if (node_alloc_noretry)
+		nodes_clear(*node_alloc_noretry);
 
 	for (i = 0; i < h->max_huge_pages; ++i) {
 		if (hstate_is_gigantic(h)) {
 			if (!alloc_bootmem_huge_page(h))
 				break;
 		} else if (!alloc_pool_huge_page(h,
-					 &node_states[N_MEMORY]))
+					 &node_states[N_MEMORY],
+					 node_alloc_noretry))
 			break;
 		cond_resched();
 	}
@@ -2225,6 +2277,9 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 			h->max_huge_pages, buf, i);
 		h->max_huge_pages = i;
 	}
+
+	if (node_alloc_noretry)
+		kfree(node_alloc_noretry);
 }
 
 static void __init hugetlb_init_hstates(void)
@@ -2323,6 +2378,12 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 			      nodemask_t *nodes_allowed)
 {
 	unsigned long min_count, ret;
+	NODEMASK_ALLOC(nodemask_t, node_alloc_noretry,
+						GFP_KERNEL | __GFP_NORETRY);
+
+	/* bit mask controlling how hard we retry per-node allocations */
+	if (node_alloc_noretry)
+		nodes_clear(*node_alloc_noretry);
 
 	spin_lock(&hugetlb_lock);
 
@@ -2356,6 +2417,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
 		if (count > persistent_huge_pages(h)) {
 			spin_unlock(&hugetlb_lock);
+			if (node_alloc_noretry)
+				NODEMASK_FREE(node_alloc_noretry);
 			return -EINVAL;
 		}
 		/* Fall through to decrease pool */
@@ -2388,7 +2451,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 		/* yield cpu to avoid soft lockup */
 		cond_resched();
 
-		ret = alloc_pool_huge_page(h, nodes_allowed);
+		ret = alloc_pool_huge_page(h, nodes_allowed,
+						node_alloc_noretry);
 		spin_lock(&hugetlb_lock);
 		if (!ret)
 			goto out;
@@ -2429,6 +2493,9 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	h->max_huge_pages = persistent_huge_pages(h);
 	spin_unlock(&hugetlb_lock);
 
+	if (node_alloc_noretry)
+		NODEMASK_FREE(node_alloc_noretry);
+
 	return 0;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection
  2019-07-24 17:50 ` [RFC PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection Mike Kravetz
@ 2019-07-25  8:05   ` Mel Gorman
  2019-07-26  7:40     ` Hillf Danton
                       ` (2 more replies)
  0 siblings, 3 replies; 26+ messages in thread
From: Mel Gorman @ 2019-07-25  8:05 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Hillf Danton, Michal Hocko,
	Vlastimil Babka, Johannes Weiner, Andrew Morton

On Wed, Jul 24, 2019 at 10:50:12AM -0700, Mike Kravetz wrote:
> From: Hillf Danton <hdanton@sina.com>
> 
> Address the issue of should_continue_reclaim continuing true too often
> for __GFP_RETRY_MAYFAIL attempts when !nr_reclaimed and nr_scanned.
> This could happen during hugetlb page allocation causing stalls for
> minutes or hours.
> 
> Restructure code so that false will be returned in this case even if
> there are plenty of inactive pages.
> 
> Need better description from Hillf Danton
> Need SOB from Hillf Danton

Agreed that the description could do with improvement. However, it
makes sense that if compaction reports it can make progress that it is
unnecessary to continue reclaiming. There might be side-effects with
allocation latencies in other cases but that would come at the cost of
potential premature reclaim which has consequences of itself. Overall I
think it's an improvement so I'll ack it once it has a S-O-B.

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/3] mm, compaction: use MIN_COMPACT_COSTLY_PRIORITY everywhere for costly orders
  2019-07-24 17:50 ` [RFC PATCH 2/3] mm, compaction: use MIN_COMPACT_COSTLY_PRIORITY everywhere for costly orders Mike Kravetz
@ 2019-07-25  8:06   ` Mel Gorman
  2019-07-31 12:06   ` Vlastimil Babka
  1 sibling, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2019-07-25  8:06 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Hillf Danton, Michal Hocko,
	Vlastimil Babka, Johannes Weiner, Andrew Morton

On Wed, Jul 24, 2019 at 10:50:13AM -0700, Mike Kravetz wrote:
> For PAGE_ALLOC_COSTLY_ORDER allocations, MIN_COMPACT_COSTLY_PRIORITY is
> minimum (highest priority).  Other places in the compaction code key off
> of MIN_COMPACT_PRIORITY.  Costly order allocations will never get to
> MIN_COMPACT_PRIORITY.  Therefore, some conditions will never be met for
> costly order allocations.
> 
> This was observed when hugetlb allocations could stall for minutes or
> hours when should_compact_retry() would return true more often then it
> should.  Specifically, this was in the case where compact_result was
> COMPACT_DEFERRED and COMPACT_PARTIAL_SKIPPED and no progress was being
> made.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/3] hugetlbfs: don't retry when pool page allocations start to fail
  2019-07-24 17:50 ` [RFC PATCH 3/3] hugetlbfs: don't retry when pool page allocations start to fail Mike Kravetz
@ 2019-07-25  8:13   ` Mel Gorman
  2019-07-25 17:15     ` Mike Kravetz
  0 siblings, 1 reply; 26+ messages in thread
From: Mel Gorman @ 2019-07-25  8:13 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Hillf Danton, Michal Hocko,
	Vlastimil Babka, Johannes Weiner, Andrew Morton

On Wed, Jul 24, 2019 at 10:50:14AM -0700, Mike Kravetz wrote:
> When allocating hugetlbfs pool pages via /proc/sys/vm/nr_hugepages,
> the pages will be interleaved between all nodes of the system.  If
> nodes are not equal, it is quite possible for one node to fill up
> before the others.  When this happens, the code still attempts to
> allocate pages from the full node.  This results in calls to direct
> reclaim and compaction which slow things down considerably.
> 
> When allocating pool pages, note the state of the previous allocation
> for each node.  If previous allocation failed, do not use the
> aggressive retry algorithm on successive attempts.  The allocation
> will still succeed if there is memory available, but it will not try
> as hard to free up memory.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

set_max_huge_pages can fail the NODEMASK_ALLOC() alloc which you handle
*but* in the event of an allocation failure this bug can silently recur.
An informational message might be justified in that case in case the
stall should recur with no hint as to why. Technically passing NULL into
NODEMASK_FREE is also safe as kfree (if used for that kernel config) can
handle freeing of a NULL pointer. However, that is cosmetic more than
anything. Whether you decide to change either or not;

Acked-by: Mel Gorman <mgorman@suse.de>

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/3] hugetlbfs: don't retry when pool page allocations start to fail
  2019-07-25  8:13   ` Mel Gorman
@ 2019-07-25 17:15     ` Mike Kravetz
  2019-07-25 22:43       ` Mel Gorman
  2019-07-31 13:23       ` Vlastimil Babka
  0 siblings, 2 replies; 26+ messages in thread
From: Mike Kravetz @ 2019-07-25 17:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-kernel, Hillf Danton, Michal Hocko,
	Vlastimil Babka, Johannes Weiner, Andrew Morton

On 7/25/19 1:13 AM, Mel Gorman wrote:
> On Wed, Jul 24, 2019 at 10:50:14AM -0700, Mike Kravetz wrote:
>> When allocating hugetlbfs pool pages via /proc/sys/vm/nr_hugepages,
>> the pages will be interleaved between all nodes of the system.  If
>> nodes are not equal, it is quite possible for one node to fill up
>> before the others.  When this happens, the code still attempts to
>> allocate pages from the full node.  This results in calls to direct
>> reclaim and compaction which slow things down considerably.
>>
>> When allocating pool pages, note the state of the previous allocation
>> for each node.  If previous allocation failed, do not use the
>> aggressive retry algorithm on successive attempts.  The allocation
>> will still succeed if there is memory available, but it will not try
>> as hard to free up memory.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> set_max_huge_pages can fail the NODEMASK_ALLOC() alloc which you handle
> *but* in the event of an allocation failure this bug can silently recur.
> An informational message might be justified in that case in case the
> stall should recur with no hint as to why.

Right.
Perhaps a NODEMASK_ALLOC() failure should just result in a quick exit/error.
If we can't allocate a node mask, it is unlikely we will be able to allocate
a/any huge pages.  And, the system must be extremely low on memory and there
are likely other bigger issues.

There have been discussions elsewhere about discontinuing the use of
NODEMASK_ALLOC() and just putting the mask on the stack.  That may be
acceptable here as well.

>                                            Technically passing NULL into
> NODEMASK_FREE is also safe as kfree (if used for that kernel config) can
> handle freeing of a NULL pointer. However, that is cosmetic more than
> anything. Whether you decide to change either or not;

Yes.
I will clean up with an updated series after more feedback.

> 
> Acked-by: Mel Gorman <mgorman@suse.de>
> 

Thanks!
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/3] hugetlbfs: don't retry when pool page allocations start to fail
  2019-07-25 17:15     ` Mike Kravetz
@ 2019-07-25 22:43       ` Mel Gorman
  2019-07-31 13:23       ` Vlastimil Babka
  1 sibling, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2019-07-25 22:43 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Hillf Danton, Michal Hocko,
	Vlastimil Babka, Johannes Weiner, Andrew Morton

On Thu, Jul 25, 2019 at 10:15:29AM -0700, Mike Kravetz wrote:
> On 7/25/19 1:13 AM, Mel Gorman wrote:
> > On Wed, Jul 24, 2019 at 10:50:14AM -0700, Mike Kravetz wrote:
> >> When allocating hugetlbfs pool pages via /proc/sys/vm/nr_hugepages,
> >> the pages will be interleaved between all nodes of the system.  If
> >> nodes are not equal, it is quite possible for one node to fill up
> >> before the others.  When this happens, the code still attempts to
> >> allocate pages from the full node.  This results in calls to direct
> >> reclaim and compaction which slow things down considerably.
> >>
> >> When allocating pool pages, note the state of the previous allocation
> >> for each node.  If previous allocation failed, do not use the
> >> aggressive retry algorithm on successive attempts.  The allocation
> >> will still succeed if there is memory available, but it will not try
> >> as hard to free up memory.
> >>
> >> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> > 
> > set_max_huge_pages can fail the NODEMASK_ALLOC() alloc which you handle
> > *but* in the event of an allocation failure this bug can silently recur.
> > An informational message might be justified in that case in case the
> > stall should recur with no hint as to why.
> 
> Right.
> Perhaps a NODEMASK_ALLOC() failure should just result in a quick exit/error.
> If we can't allocate a node mask, it is unlikely we will be able to allocate
> a/any huge pages.  And, the system must be extremely low on memory and there
> are likely other bigger issues.
> 

That might be better overall, you make a valid point that a failed
kmalloc is not a good sign for hugetlbfs allocations.

> There have been discussions elsewhere about discontinuing the use of
> NODEMASK_ALLOC() and just putting the mask on the stack.  That may be
> acceptable here as well.
> 

They can be big and while this particular path would be relatively safe,
I think the fact that there will not be much functional difference
between allocating on the stack and a failed kmalloc in terms of
hugetlbfs allocation success rates.

> >                                            Technically passing NULL into
> > NODEMASK_FREE is also safe as kfree (if used for that kernel config) can
> > handle freeing of a NULL pointer. However, that is cosmetic more than
> > anything. Whether you decide to change either or not;
> 
> Yes.
> I will clean up with an updated series after more feedback.
> 

Thanks.

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection
  2019-07-25  8:05   ` Mel Gorman
@ 2019-07-26  7:40     ` Hillf Danton
  2019-07-26  8:12     ` Mel Gorman
  2019-07-31 11:08     ` Vlastimil Babka
  2 siblings, 0 replies; 26+ messages in thread
From: Hillf Danton @ 2019-07-26  7:40 UTC (permalink / raw)
  To: Mel Gorman, Mike Kravetz
  Cc: linux-mm, linux-kernel, Michal Hocko, Vlastimil Babka,
	Johannes Weiner, Andrew Morton, Hillf Danton


On Thu, 25 Jul 2019 08:05:55 +0000 (UTC) Mel Gorman wrote:
> 
> Agreed that the description could do with improvement. However, it
> makes sense that if compaction reports it can make progress that it is
> unnecessary to continue reclaiming.

Thanks Mike and Mel.

Hillf
---8<---
From: Hillf Danton <hdanton@sina.com>
Subject: [RFC PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection

Address the issue of should_continue_reclaim continuing true too often
for __GFP_RETRY_MAYFAIL attempts when !nr_reclaimed and nr_scanned.
This could happen during hugetlb page allocation causing stalls for
minutes or hours.

We can stop reclaiming pages if compaction reports it can make a progress.
A code reshuffle is needed to do that. And it has side-effects, however,
with allocation latencies in other cases but that would come at the cost
of potential premature reclaim which has consequences of itself.

We can also bail out of reclaiming pages if we know that there are not
enough inactive lru pages left to satisfy the costly allocation.

We can give up reclaiming pages too if we see dryrun occur, with the
certainty of plenty of inactive pages. IOW with dryrun detected, we are
sure we have reclaimed as many pages as we could.

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Hillf Danton <hdanton@sina.com>
---
 mm/vmscan.c | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index f4fd02a..484b6b1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2673,18 +2673,6 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 			return false;
 	}
 
-	/*
-	 * If we have not reclaimed enough pages for compaction and the
-	 * inactive lists are large enough, continue reclaiming
-	 */
-	pages_for_compaction = compact_gap(sc->order);
-	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
-	if (get_nr_swap_pages() > 0)
-		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
-	if (sc->nr_reclaimed < pages_for_compaction &&
-			inactive_lru_pages > pages_for_compaction)
-		return true;
-
 	/* If compaction would go ahead or the allocation would succeed, stop */
 	for (z = 0; z <= sc->reclaim_idx; z++) {
 		struct zone *zone = &pgdat->node_zones[z];
@@ -2700,7 +2688,21 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 			;
 		}
 	}
-	return true;
+
+	/*
+	 * If we have not reclaimed enough pages for compaction and the
+	 * inactive lists are large enough, continue reclaiming
+	 */
+	pages_for_compaction = compact_gap(sc->order);
+	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
+	if (get_nr_swap_pages() > 0)
+		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
+
+	return inactive_lru_pages > pages_for_compaction &&
+		/*
+		 * avoid dryrun with plenty of inactive pages
+		 */
+		nr_scanned && nr_reclaimed;
 }
 
 static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
--


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection
  2019-07-25  8:05   ` Mel Gorman
  2019-07-26  7:40     ` Hillf Danton
@ 2019-07-26  8:12     ` Mel Gorman
  2019-07-31 11:08     ` Vlastimil Babka
  2 siblings, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2019-07-26  8:12 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Mike Kravetz, linux-mm, linux-kernel, Michal Hocko,
	Vlastimil Babka, Johannes Weiner, Andrew Morton

> From: Hillf Danton <hdanton@sina.com>
> Subject: [RFC PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection
> 
> Address the issue of should_continue_reclaim continuing true too often
> for __GFP_RETRY_MAYFAIL attempts when !nr_reclaimed and nr_scanned.
> This could happen during hugetlb page allocation causing stalls for
> minutes or hours.
> 
> We can stop reclaiming pages if compaction reports it can make a progress.
> A code reshuffle is needed to do that. And it has side-effects, however,
> with allocation latencies in other cases but that would come at the cost
> of potential premature reclaim which has consequences of itself.
> 
> We can also bail out of reclaiming pages if we know that there are not
> enough inactive lru pages left to satisfy the costly allocation.
> 
> We can give up reclaiming pages too if we see dryrun occur, with the
> certainty of plenty of inactive pages. IOW with dryrun detected, we are
> sure we have reclaimed as many pages as we could.
> 
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Hillf Danton <hdanton@sina.com>

Acked-by: Mel Gorman <mgorman@suse.de>

Thanks Hillf

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection
  2019-07-25  8:05   ` Mel Gorman
  2019-07-26  7:40     ` Hillf Danton
  2019-07-26  8:12     ` Mel Gorman
@ 2019-07-31 11:08     ` Vlastimil Babka
  2019-07-31 12:25       ` Mel Gorman
                         ` (2 more replies)
  2 siblings, 3 replies; 26+ messages in thread
From: Vlastimil Babka @ 2019-07-31 11:08 UTC (permalink / raw)
  To: Hillf Danton, Mel Gorman, Mike Kravetz
  Cc: linux-mm, linux-kernel, Michal Hocko, Johannes Weiner, Andrew Morton

On 7/26/19 9:40 AM, Hillf Danton wrote:
> 
> On Thu, 25 Jul 2019 08:05:55 +0000 (UTC) Mel Gorman wrote:
>>
>> Agreed that the description could do with improvement. However, it
>> makes sense that if compaction reports it can make progress that it is
>> unnecessary to continue reclaiming.
> 
> Thanks Mike and Mel.
> 
> Hillf
> ---8<---
> From: Hillf Danton <hdanton@sina.com>
> Subject: [RFC PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection
> 
> Address the issue of should_continue_reclaim continuing true too often
> for __GFP_RETRY_MAYFAIL attempts when !nr_reclaimed and nr_scanned.
> This could happen during hugetlb page allocation causing stalls for
> minutes or hours.
> 
> We can stop reclaiming pages if compaction reports it can make a progress.
> A code reshuffle is needed to do that. And it has side-effects, however,
> with allocation latencies in other cases but that would come at the cost
> of potential premature reclaim which has consequences of itself.

I don't really understand that paragraph, did Mel meant it like this?

> We can also bail out of reclaiming pages if we know that there are not
> enough inactive lru pages left to satisfy the costly allocation.
> 
> We can give up reclaiming pages too if we see dryrun occur, with the
> certainty of plenty of inactive pages. IOW with dryrun detected, we are
> sure we have reclaimed as many pages as we could.
> 
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Hillf Danton <hdanton@sina.com>

I agree this is an improvement overall, but perhaps the patch does too
many things at once. The reshuffle is one thing and makes sense. The
change of the last return condition could perhaps be separate. Also
AFAICS the ultimate result is that when nr_reclaimed == 0, the function
will now always return false. Which makes the initial test for
__GFP_RETRY_MAYFAIL and the comments there misleading. There will no
longer be a full LRU scan guaranteed - as long as the scanned LRU chunk
yields no reclaimed page, we abort.

> ---
>  mm/vmscan.c | 28 +++++++++++++++-------------
>  1 file changed, 15 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index f4fd02a..484b6b1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2673,18 +2673,6 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
>  			return false;
>  	}
>  
> -	/*
> -	 * If we have not reclaimed enough pages for compaction and the
> -	 * inactive lists are large enough, continue reclaiming
> -	 */
> -	pages_for_compaction = compact_gap(sc->order);
> -	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
> -	if (get_nr_swap_pages() > 0)
> -		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
> -	if (sc->nr_reclaimed < pages_for_compaction &&
> -			inactive_lru_pages > pages_for_compaction)
> -		return true;
> -
>  	/* If compaction would go ahead or the allocation would succeed, stop */
>  	for (z = 0; z <= sc->reclaim_idx; z++) {
>  		struct zone *zone = &pgdat->node_zones[z];
> @@ -2700,7 +2688,21 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
>  			;
>  		}
>  	}
> -	return true;
> +
> +	/*
> +	 * If we have not reclaimed enough pages for compaction and the
> +	 * inactive lists are large enough, continue reclaiming
> +	 */
> +	pages_for_compaction = compact_gap(sc->order);
> +	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
> +	if (get_nr_swap_pages() > 0)
> +		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
> +
> +	return inactive_lru_pages > pages_for_compaction &&
> +		/*
> +		 * avoid dryrun with plenty of inactive pages
> +		 */
> +		nr_scanned && nr_reclaimed;
>  }
>  
>  static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
> --
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/3] mm, compaction: use MIN_COMPACT_COSTLY_PRIORITY everywhere for costly orders
  2019-07-24 17:50 ` [RFC PATCH 2/3] mm, compaction: use MIN_COMPACT_COSTLY_PRIORITY everywhere for costly orders Mike Kravetz
  2019-07-25  8:06   ` Mel Gorman
@ 2019-07-31 12:06   ` Vlastimil Babka
  2019-07-31 20:30     ` Mike Kravetz
  1 sibling, 1 reply; 26+ messages in thread
From: Vlastimil Babka @ 2019-07-31 12:06 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Hillf Danton, Michal Hocko, Mel Gorman, Johannes Weiner, Andrew Morton

On 7/24/19 7:50 PM, Mike Kravetz wrote:
> For PAGE_ALLOC_COSTLY_ORDER allocations, MIN_COMPACT_COSTLY_PRIORITY is
> minimum (highest priority).  Other places in the compaction code key off
> of MIN_COMPACT_PRIORITY.  Costly order allocations will never get to
> MIN_COMPACT_PRIORITY.  Therefore, some conditions will never be met for
> costly order allocations.
> 
> This was observed when hugetlb allocations could stall for minutes or
> hours when should_compact_retry() would return true more often then it
> should.  Specifically, this was in the case where compact_result was
> COMPACT_DEFERRED and COMPACT_PARTIAL_SKIPPED and no progress was being
> made.

Hmm, the point of MIN_COMPACT_COSTLY_PRIORITY was that costly
allocations will not reach the priority where compaction becomes too
expensive. With your patch, they still don't reach that priority value,
but are allowed to be thorough anyway, even sooner. That just seems like
a wrong way to fix the problem. If should_compact_retry() returns
misleading results for costly allocations, then that should be fixed
instead?

Alternatively, you might want to say that hugetlb allocations are not
like other random costly allocations, because the admin setting
nr_hugepages is prepared to take the cost (I thought that was indicated
by the __GFP_RETRY_MAYFAIL flag, but seeing all the other users of it,
I'm not sure anymore). In that case should_compact_retry() could take
__GFP_RETRY_MAYFAIL into account and allow MIN_COMPACT_PRIORITY even for
costly allocations.

> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/compaction.c | 18 +++++++++++++-----
>  1 file changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/compaction.c b/mm/compaction.c
> index 952dc2fb24e5..325b746068d1 100644
> --- a/mm/compaction.c
> +++ b/mm/compaction.c
> @@ -2294,9 +2294,15 @@ static enum compact_result compact_zone_order(struct zone *zone, int order,
>  		.alloc_flags = alloc_flags,
>  		.classzone_idx = classzone_idx,
>  		.direct_compaction = true,
> -		.whole_zone = (prio == MIN_COMPACT_PRIORITY),
> -		.ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY),
> -		.ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY)
> +		.whole_zone = ((order > PAGE_ALLOC_COSTLY_ORDER) ?
> +				(prio == MIN_COMPACT_COSTLY_PRIORITY) :
> +				(prio == MIN_COMPACT_PRIORITY)),
> +		.ignore_skip_hint = ((order > PAGE_ALLOC_COSTLY_ORDER) ?
> +				(prio == MIN_COMPACT_COSTLY_PRIORITY) :
> +				(prio == MIN_COMPACT_PRIORITY)),
> +		.ignore_block_suitable = ((order > PAGE_ALLOC_COSTLY_ORDER) ?
> +				(prio == MIN_COMPACT_COSTLY_PRIORITY) :
> +				(prio == MIN_COMPACT_PRIORITY))
>  	};
>  	struct capture_control capc = {
>  		.cc = &cc,
> @@ -2338,6 +2344,7 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
>  	int may_perform_io = gfp_mask & __GFP_IO;
>  	struct zoneref *z;
>  	struct zone *zone;
> +	int min_priority;
>  	enum compact_result rc = COMPACT_SKIPPED;
>  
>  	/*
> @@ -2350,12 +2357,13 @@ enum compact_result try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
>  	trace_mm_compaction_try_to_compact_pages(order, gfp_mask, prio);
>  
>  	/* Compact each zone in the list */
> +	min_priority = (order > PAGE_ALLOC_COSTLY_ORDER) ?
> +			MIN_COMPACT_COSTLY_PRIORITY : MIN_COMPACT_PRIORITY;
>  	for_each_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx,
>  								ac->nodemask) {
>  		enum compact_result status;
>  
> -		if (prio > MIN_COMPACT_PRIORITY
> -					&& compaction_deferred(zone, order)) {
> +		if (prio > min_priority && compaction_deferred(zone, order)) {
>  			rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
>  			continue;
>  		}
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection
  2019-07-31 11:08     ` Vlastimil Babka
@ 2019-07-31 12:25       ` Mel Gorman
  2019-07-31 21:11       ` Mike Kravetz
  2019-08-01  3:16       ` Hillf Danton
  2 siblings, 0 replies; 26+ messages in thread
From: Mel Gorman @ 2019-07-31 12:25 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Hillf Danton, Mike Kravetz, linux-mm, linux-kernel, Michal Hocko,
	Johannes Weiner, Andrew Morton

On Wed, Jul 31, 2019 at 01:08:44PM +0200, Vlastimil Babka wrote:
> On 7/26/19 9:40 AM, Hillf Danton wrote:
> > 
> > On Thu, 25 Jul 2019 08:05:55 +0000 (UTC) Mel Gorman wrote:
> >>
> >> Agreed that the description could do with improvement. However, it
> >> makes sense that if compaction reports it can make progress that it is
> >> unnecessary to continue reclaiming.
> > 
> > Thanks Mike and Mel.
> > 
> > Hillf
> > ---8<---
> > From: Hillf Danton <hdanton@sina.com>
> > Subject: [RFC PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection
> > 
> > Address the issue of should_continue_reclaim continuing true too often
> > for __GFP_RETRY_MAYFAIL attempts when !nr_reclaimed and nr_scanned.
> > This could happen during hugetlb page allocation causing stalls for
> > minutes or hours.
> > 
> > We can stop reclaiming pages if compaction reports it can make a progress.
> > A code reshuffle is needed to do that. And it has side-effects, however,
> > with allocation latencies in other cases but that would come at the cost
> > of potential premature reclaim which has consequences of itself.
> 
> I don't really understand that paragraph, did Mel meant it like this?
> 

Fundamentally, the balancing act is between a) reclaiming more now so
that compaction is more likely to succeed or b) keep pages resident to
avoid refaulting.

With a) high order allocations are faster, less likely to stall and more
likely to succeed. However, it can also prematurely reclaim pages and free
more memory than is necessary for compaction to succeed in a reasonable
amount of time. We also know from testing that it can hit corner cases
with hugetlbfs where stalls happen for prolonged periods of time anyway
and the series overall is known to fix those stalls.

> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Johannes Weiner <hannes@cmpxchg.org>
> > Signed-off-by: Hillf Danton <hdanton@sina.com>
> 
> I agree this is an improvement overall, but perhaps the patch does too
> many things at once. The reshuffle is one thing and makes sense. The
> change of the last return condition could perhaps be separate. Also
> AFAICS the ultimate result is that when nr_reclaimed == 0, the function
> will now always return false. Which makes the initial test for
> __GFP_RETRY_MAYFAIL and the comments there misleading. There will no
> longer be a full LRU scan guaranteed - as long as the scanned LRU chunk
> yields no reclaimed page, we abort.
> 

I've no strong feelings on whether it is worth splitting the patch. In
my mind it's more or less doing one thing even though the one thing is a
relatively high-level problem.

-- 
Mel Gorman
SUSE Labs


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/3] hugetlbfs: don't retry when pool page allocations start to fail
  2019-07-25 17:15     ` Mike Kravetz
  2019-07-25 22:43       ` Mel Gorman
@ 2019-07-31 13:23       ` Vlastimil Babka
  2019-07-31 21:13         ` Mike Kravetz
  1 sibling, 1 reply; 26+ messages in thread
From: Vlastimil Babka @ 2019-07-31 13:23 UTC (permalink / raw)
  To: Mike Kravetz, Mel Gorman
  Cc: linux-mm, linux-kernel, Hillf Danton, Michal Hocko,
	Johannes Weiner, Andrew Morton

On 7/25/19 7:15 PM, Mike Kravetz wrote:
> On 7/25/19 1:13 AM, Mel Gorman wrote:
>> On Wed, Jul 24, 2019 at 10:50:14AM -0700, Mike Kravetz wrote:
>>> When allocating hugetlbfs pool pages via /proc/sys/vm/nr_hugepages,
>>> the pages will be interleaved between all nodes of the system.  If
>>> nodes are not equal, it is quite possible for one node to fill up
>>> before the others.  When this happens, the code still attempts to
>>> allocate pages from the full node.  This results in calls to direct
>>> reclaim and compaction which slow things down considerably.
>>>
>>> When allocating pool pages, note the state of the previous allocation
>>> for each node.  If previous allocation failed, do not use the
>>> aggressive retry algorithm on successive attempts.  The allocation
>>> will still succeed if there is memory available, but it will not try
>>> as hard to free up memory.
>>>
>>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>>
>> set_max_huge_pages can fail the NODEMASK_ALLOC() alloc which you handle
>> *but* in the event of an allocation failure this bug can silently recur.
>> An informational message might be justified in that case in case the
>> stall should recur with no hint as to why.
> 
> Right.
> Perhaps a NODEMASK_ALLOC() failure should just result in a quick exit/error.
> If we can't allocate a node mask, it is unlikely we will be able to allocate
> a/any huge pages.  And, the system must be extremely low on memory and there
> are likely other bigger issues.

Agreed. But I would perhaps drop __GFP_NORETRY from the mask allocation
as that can fail for transient conditions.

> There have been discussions elsewhere about discontinuing the use of
> NODEMASK_ALLOC() and just putting the mask on the stack.  That may be
> acceptable here as well.
> 
>>                                            Technically passing NULL into
>> NODEMASK_FREE is also safe as kfree (if used for that kernel config) can
>> handle freeing of a NULL pointer. However, that is cosmetic more than
>> anything. Whether you decide to change either or not;
> 
> Yes.
> I will clean up with an updated series after more feedback.
> 
>>
>> Acked-by: Mel Gorman <mgorman@suse.de>
>>
> 
> Thanks!
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/3] mm, compaction: use MIN_COMPACT_COSTLY_PRIORITY everywhere for costly orders
  2019-07-31 12:06   ` Vlastimil Babka
@ 2019-07-31 20:30     ` Mike Kravetz
  2019-08-01 13:01       ` Vlastimil Babka
  0 siblings, 1 reply; 26+ messages in thread
From: Mike Kravetz @ 2019-07-31 20:30 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm, linux-kernel
  Cc: Hillf Danton, Michal Hocko, Mel Gorman, Johannes Weiner, Andrew Morton

On 7/31/19 5:06 AM, Vlastimil Babka wrote:
> On 7/24/19 7:50 PM, Mike Kravetz wrote:
>> For PAGE_ALLOC_COSTLY_ORDER allocations, MIN_COMPACT_COSTLY_PRIORITY is
>> minimum (highest priority).  Other places in the compaction code key off
>> of MIN_COMPACT_PRIORITY.  Costly order allocations will never get to
>> MIN_COMPACT_PRIORITY.  Therefore, some conditions will never be met for
>> costly order allocations.
>>
>> This was observed when hugetlb allocations could stall for minutes or
>> hours when should_compact_retry() would return true more often then it
>> should.  Specifically, this was in the case where compact_result was
>> COMPACT_DEFERRED and COMPACT_PARTIAL_SKIPPED and no progress was being
>> made.
> 
> Hmm, the point of MIN_COMPACT_COSTLY_PRIORITY was that costly
> allocations will not reach the priority where compaction becomes too
> expensive. With your patch, they still don't reach that priority value,
> but are allowed to be thorough anyway, even sooner. That just seems like
> a wrong way to fix the problem.

Thanks Vlastimil, here is why I took the approach I did.

I instrumented some of the long stalls.  Here is one common example:
should_compact_retry returned true 5000000 consecutive times.  However,
the variable compaction_retries is zero.  We never get to the code that
increments the compaction_retries count because compaction_made_progress
is false and compaction_withdrawn is true.  As suggested earlier, I noted
why compaction_withdrawn is true.  Of the 5000000 calls,
4921875 were COMPACT_DEFERRED
78125 were COMPACT_PARTIAL_SKIPPED
Note that 5000000/64(1 << COMPACT_MAX_DEFER_SHIFT) == 78125

I then started looking into why COMPACT_DEFERRED and COMPACT_PARTIAL_SKIPPED
were being set/returned so often.
COMPACT_DEFERRED is set/returned in try_to_compact_pages.  Specifically,
		if (prio > MIN_COMPACT_PRIORITY
					&& compaction_deferred(zone, order)) {
			rc = max_t(enum compact_result, COMPACT_DEFERRED, rc);
			continue;
		}
COMPACT_PARTIAL_SKIPPED is set/returned in __compact_finished. Specifically,
	if (compact_scanners_met(cc)) {
		/* Let the next compaction start anew. */
		reset_cached_positions(cc->zone);

		/* ... */

		if (cc->direct_compaction)
			cc->zone->compact_blockskip_flush = true;

		if (cc->whole_zone)
			return COMPACT_COMPLETE;
		else
			return COMPACT_PARTIAL_SKIPPED;
	}

In both cases, compact_priority being MIN_COMPACT_COSTLY_PRIORITY and not
being able to go to MIN_COMPACT_PRIORITY caused the 'compaction_withdrawn'
result to be set/returned.

I do not know the subtleties of the compaction code, but it seems like
retrying in this manner does not make sense.

>                                 If should_compact_retry() returns
> misleading results for costly allocations, then that should be fixed
> instead?
> 
> Alternatively, you might want to say that hugetlb allocations are not
> like other random costly allocations, because the admin setting
> nr_hugepages is prepared to take the cost (I thought that was indicated
> by the __GFP_RETRY_MAYFAIL flag, but seeing all the other users of it,
> I'm not sure anymore).

The example above, resulted in a stall of a little over 5 minutes.  However,
I have seen them last for hours.  Sure, the caller (admin for hugetlbfs)
knows there may be high costs.  But, I think minutes/hours to try and allocate
a single huge page is too much.  We should fail sooner that that.

>                        In that case should_compact_retry() could take
> __GFP_RETRY_MAYFAIL into account and allow MIN_COMPACT_PRIORITY even for
> costly allocations.

I'll put something like this together to test.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection
  2019-07-31 11:08     ` Vlastimil Babka
  2019-07-31 12:25       ` Mel Gorman
@ 2019-07-31 21:11       ` Mike Kravetz
  2019-08-01  4:22         ` Hillf Danton
  2019-08-01  8:44         ` Vlastimil Babka
  2019-08-01  3:16       ` Hillf Danton
  2 siblings, 2 replies; 26+ messages in thread
From: Mike Kravetz @ 2019-07-31 21:11 UTC (permalink / raw)
  To: Vlastimil Babka, Hillf Danton, Mel Gorman
  Cc: linux-mm, linux-kernel, Michal Hocko, Johannes Weiner, Andrew Morton

On 7/31/19 4:08 AM, Vlastimil Babka wrote:
> 
> I agree this is an improvement overall, but perhaps the patch does too
> many things at once. The reshuffle is one thing and makes sense. The
> change of the last return condition could perhaps be separate. Also
> AFAICS the ultimate result is that when nr_reclaimed == 0, the function
> will now always return false. Which makes the initial test for
> __GFP_RETRY_MAYFAIL and the comments there misleading. There will no
> longer be a full LRU scan guaranteed - as long as the scanned LRU chunk
> yields no reclaimed page, we abort.

Can someone help me understand why nr_scanned == 0 guarantees a full
LRU scan?  FWICS, nr_scanned used in this context is only incremented
in shrink_page_list and potentially shrink_zones.  In the stall case I
am looking at, there are MANY cases in which nr_scanned is only a few
pages and none of those are reclaimed.

Can we not get nr_scanned == 0 on an arbitrary chunk of the LRU?

I must be missing something, because I do not see how nr_scanned == 0
guarantees a full scan.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 3/3] hugetlbfs: don't retry when pool page allocations start to fail
  2019-07-31 13:23       ` Vlastimil Babka
@ 2019-07-31 21:13         ` Mike Kravetz
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Kravetz @ 2019-07-31 21:13 UTC (permalink / raw)
  To: Vlastimil Babka, Mel Gorman
  Cc: linux-mm, linux-kernel, Hillf Danton, Michal Hocko,
	Johannes Weiner, Andrew Morton

On 7/31/19 6:23 AM, Vlastimil Babka wrote:
> On 7/25/19 7:15 PM, Mike Kravetz wrote:
>> On 7/25/19 1:13 AM, Mel Gorman wrote:
>>> On Wed, Jul 24, 2019 at 10:50:14AM -0700, Mike Kravetz wrote:
>>>
>>> set_max_huge_pages can fail the NODEMASK_ALLOC() alloc which you handle
>>> *but* in the event of an allocation failure this bug can silently recur.
>>> An informational message might be justified in that case in case the
>>> stall should recur with no hint as to why.
>>
>> Right.
>> Perhaps a NODEMASK_ALLOC() failure should just result in a quick exit/error.
>> If we can't allocate a node mask, it is unlikely we will be able to allocate
>> a/any huge pages.  And, the system must be extremely low on memory and there
>> are likely other bigger issues.
> 
> Agreed. But I would perhaps drop __GFP_NORETRY from the mask allocation
> as that can fail for transient conditions.

Thanks, I was unsure if adding __GFP_NORETRY would be a good idea.

-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection
  2019-07-31 11:08     ` Vlastimil Babka
  2019-07-31 12:25       ` Mel Gorman
  2019-07-31 21:11       ` Mike Kravetz
@ 2019-08-01  3:16       ` Hillf Danton
  2 siblings, 0 replies; 26+ messages in thread
From: Hillf Danton @ 2019-08-01  3:16 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Mel Gorman, Mike Kravetz, linux-mm, linux-kernel, Michal Hocko,
	Johannes Weiner, Andrew Morton


On Wed, 31 Jul 2019 13:08:44 +0200 Vlastimil Babka wrote:
> 
> I agree this is an improvement overall, but perhaps the patch does too
> many things at once. The reshuffle is one thing and makes sense. The
> change of the last return condition could perhaps be separate. Also
> AFAICS the ultimate result is that when nr_reclaimed == 0, the function
> will now always return false. Which makes the initial test for
> __GFP_RETRY_MAYFAIL and the comments there misleading. There will no
> longer be a full LRU scan guaranteed - as long as the scanned LRU chunk
> yields no reclaimed page, we abort.
> 
Thanks Vlastimil.

We can drop the test for __GFP_RETRY_MAYFAIL imo after observing no
regression produced by the reshuffle combined with dryrun detection
for reasons that
1) such a full lru scan is too expensive a cost for kswapd to pay for
balancing a node. The kthread you see drops the costly order when
appropriate. Which makes the helper in question false.

2) it has been a while otoh that reclaimers play game with neither
the costly order nor __GFP_RETRY_MAYFAIL taken into account, see
commit 2bb0f34fe3c1 ("mm: vmscan: do not iterate all mem cgroups for
global direct reclaim") for instance.

Hillf


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection
  2019-07-31 21:11       ` Mike Kravetz
@ 2019-08-01  4:22         ` Hillf Danton
  2019-08-01  8:44         ` Vlastimil Babka
  1 sibling, 0 replies; 26+ messages in thread
From: Hillf Danton @ 2019-08-01  4:22 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Vlastimil Babka, Mel Gorman, linux-mm, linux-kernel,
	Michal Hocko, Johannes Weiner, Andrew Morton


On Wed, 31 Jul 2019 14:11:02 -0700 Mike Kravetz wrote:
>
> On 7/31/19 4:08 AM, Vlastimil Babka wrote:
> > 
> > I agree this is an improvement overall, but perhaps the patch does too
> > many things at once. The reshuffle is one thing and makes sense. The
> > change of the last return condition could perhaps be separate. Also
> > AFAICS the ultimate result is that when nr_reclaimed == 0, the function
> > will now always return false. Which makes the initial test for
> > __GFP_RETRY_MAYFAIL and the comments there misleading. There will no
> > longer be a full LRU scan guaranteed - as long as the scanned LRU chunk
> > yields no reclaimed page, we abort.
> 
> Can someone help me understand why nr_scanned == 0 guarantees a full
> LRU scan?

AFAIC no pages reclaimed without pages scanned(no scanning no reclaiming).

Literally a reclaimer's KPI is usually gauged by its effeicency of
reclaiming pages, irrespective of factors like the proportion of dirty
pages, costly order, the online of a swap disk and lru size.

At the moment nr_scanned == 0 raises a red light and the reclaimer has
to take right actions accordingly. Currently there is no ruleout of a
full lru scan.

Feel free to correct me if anything missing.

Hillf


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection
  2019-07-31 21:11       ` Mike Kravetz
  2019-08-01  4:22         ` Hillf Danton
@ 2019-08-01  8:44         ` Vlastimil Babka
  1 sibling, 0 replies; 26+ messages in thread
From: Vlastimil Babka @ 2019-08-01  8:44 UTC (permalink / raw)
  To: Mike Kravetz, Hillf Danton, Mel Gorman
  Cc: linux-mm, linux-kernel, Michal Hocko, Johannes Weiner, Andrew Morton

On 7/31/19 11:11 PM, Mike Kravetz wrote:
> On 7/31/19 4:08 AM, Vlastimil Babka wrote:
>>
>> I agree this is an improvement overall, but perhaps the patch does too
>> many things at once. The reshuffle is one thing and makes sense. The
>> change of the last return condition could perhaps be separate. Also
>> AFAICS the ultimate result is that when nr_reclaimed == 0, the function
>> will now always return false. Which makes the initial test for
>> __GFP_RETRY_MAYFAIL and the comments there misleading. There will no
>> longer be a full LRU scan guaranteed - as long as the scanned LRU chunk
>> yields no reclaimed page, we abort.
> 
> Can someone help me understand why nr_scanned == 0 guarantees a full
> LRU scan?  FWICS, nr_scanned used in this context is only incremented
> in shrink_page_list and potentially shrink_zones.  In the stall case I
> am looking at, there are MANY cases in which nr_scanned is only a few
> pages and none of those are reclaimed.
> 
> Can we not get nr_scanned == 0 on an arbitrary chunk of the LRU?
> 
> I must be missing something, because I do not see how nr_scanned == 0
> guarantees a full scan.

Yeah, seems like it doesn't. More reasons to update/remove the comment.
Can be a followup cleanup if you don't want to block the series.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/3] mm, compaction: use MIN_COMPACT_COSTLY_PRIORITY everywhere for costly orders
  2019-07-31 20:30     ` Mike Kravetz
@ 2019-08-01 13:01       ` Vlastimil Babka
  2019-08-01 20:33         ` Mike Kravetz
  0 siblings, 1 reply; 26+ messages in thread
From: Vlastimil Babka @ 2019-08-01 13:01 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Hillf Danton, Michal Hocko, Mel Gorman, Johannes Weiner, Andrew Morton

On 7/31/19 10:30 PM, Mike Kravetz wrote:
> On 7/31/19 5:06 AM, Vlastimil Babka wrote:
>> On 7/24/19 7:50 PM, Mike Kravetz wrote:
>>> For PAGE_ALLOC_COSTLY_ORDER allocations,
>>> MIN_COMPACT_COSTLY_PRIORITY is minimum (highest priority).  Other
>>> places in the compaction code key off of MIN_COMPACT_PRIORITY.
>>> Costly order allocations will never get to MIN_COMPACT_PRIORITY.
>>> Therefore, some conditions will never be met for costly order
>>> allocations.
>>> 
>>> This was observed when hugetlb allocations could stall for
>>> minutes or hours when should_compact_retry() would return true
>>> more often then it should.  Specifically, this was in the case
>>> where compact_result was COMPACT_DEFERRED and
>>> COMPACT_PARTIAL_SKIPPED and no progress was being made.
>> 
>> Hmm, the point of MIN_COMPACT_COSTLY_PRIORITY was that costly 
>> allocations will not reach the priority where compaction becomes
>> too expensive. With your patch, they still don't reach that
>> priority value, but are allowed to be thorough anyway, even sooner.
>> That just seems like a wrong way to fix the problem.
> 
> Thanks Vlastimil, here is why I took the approach I did.

Thanks for the explanation.

> I instrumented some of the long stalls.  Here is one common example: 
> should_compact_retry returned true 5000000 consecutive times.
> However, the variable compaction_retries is zero.  We never get to
> the code that increments the compaction_retries count because
> compaction_made_progress is false and compaction_withdrawn is true.
> As suggested earlier, I noted why compaction_withdrawn is true.  Of
> the 5000000 calls, 4921875 were COMPACT_DEFERRED 78125 were
> COMPACT_PARTIAL_SKIPPED Note that 5000000/64(1 <<
> COMPACT_MAX_DEFER_SHIFT) == 78125
> 
> I then started looking into why COMPACT_DEFERRED and
> COMPACT_PARTIAL_SKIPPED were being set/returned so often. 
> COMPACT_DEFERRED is set/returned in try_to_compact_pages.
> Specifically, if (prio > MIN_COMPACT_PRIORITY &&
> compaction_deferred(zone, order)) {

Ah, so I see it now, this is indeed why you get so many
COMPACT_DEFERRED, as prio is always above MIN_COMPACT_PRIORITY.

> rc = max_t(enum compact_result, COMPACT_DEFERRED, rc); continue; } 
> COMPACT_PARTIAL_SKIPPED is set/returned in __compact_finished.
> Specifically, if (compact_scanners_met(cc)) { /* Let the next
> compaction start anew. */ reset_cached_positions(cc->zone);
> 
> /* ... */
> 
> if (cc->direct_compaction) cc->zone->compact_blockskip_flush = true;
> 
> if (cc->whole_zone) return COMPACT_COMPLETE; else return
> COMPACT_PARTIAL_SKIPPED; }
> 
> In both cases, compact_priority being MIN_COMPACT_COSTLY_PRIORITY and
> not being able to go to MIN_COMPACT_PRIORITY caused the
> 'compaction_withdrawn' result to be set/returned.

Hmm, looks like compaction_withdrawn() is just too blunt a test. It
mixes up results where the reaction should be more reclaim
(COMPACT_SKIPPED), and the results that depend on compaction priority
(the rest), and then we should either increase the priority, or fail.

> I do not know the subtleties of the compaction code, but it seems
> like retrying in this manner does not make sense.

I agree it doesn't, if we can't go for MIN_COMPACT_PRIORITY.

>> If should_compact_retry() returns misleading results for costly
>> allocations, then that should be fixed instead?
>> 
>> Alternatively, you might want to say that hugetlb allocations are
>> not like other random costly allocations, because the admin
>> setting nr_hugepages is prepared to take the cost (I thought that
>> was indicated by the __GFP_RETRY_MAYFAIL flag, but seeing all the
>> other users of it, I'm not sure anymore).
> 
> The example above, resulted in a stall of a little over 5 minutes.
> However, I have seen them last for hours.  Sure, the caller (admin
> for hugetlbfs) knows there may be high costs.  But, I think
> minutes/hours to try and allocate a single huge page is too much.  We
> should fail sooner that that.

Sure. We should eliminate the pointless retries in any case, the
question is whether we allow the MIN_COMPACT_PRIORITY over
MIN_COMPACT_COSTLY_PRIORITY.

>> In that case should_compact_retry() could take __GFP_RETRY_MAYFAIL
>> into account and allow MIN_COMPACT_PRIORITY even for costly
>> allocations.
> 
> I'll put something like this together to test.

Could you try testing the patch below instead? It should hopefully
eliminate the stalls. If it makes hugepage allocation give up too early,
we'll know we have to involve __GFP_RETRY_MAYFAIL in allowing the
MIN_COMPACT_PRIORITY priority. Thanks!

----8<----
diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 9569e7c786d3..b8bfe8d5d2e9 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -129,11 +129,7 @@ static inline bool compaction_failed(enum compact_result result)
 	return false;
 }
 
-/*
- * Compaction  has backed off for some reason. It might be throttling or
- * lock contention. Retrying is still worthwhile.
- */
-static inline bool compaction_withdrawn(enum compact_result result)
+static inline bool compaction_needs_reclaim(enum compact_result result)
 {
 	/*
 	 * Compaction backed off due to watermark checks for order-0
@@ -142,6 +138,15 @@ static inline bool compaction_withdrawn(enum compact_result result)
 	if (result == COMPACT_SKIPPED)
 		return true;
 
+	return false;
+}
+
+/*
+ * Compaction  has backed off for some reason. It might be throttling or
+ * lock contention. Retrying is still worthwhile.
+ */
+static inline bool compaction_withdrawn(enum compact_result result)
+{
 	/*
 	 * If compaction is deferred for high-order allocations, it is
 	 * because sync compaction recently failed. If this is the case
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 272c6de1bf4e..3dfce1f79112 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3965,6 +3965,11 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
 	if (compaction_failed(compact_result))
 		goto check_priority;
 
+	if (compaction_needs_reclaim(compact_result)) {
+		ret = compaction_zonelist_suitable(ac, order, alloc_flags);
+		goto out;
+	}
+
 	/*
 	 * make sure the compaction wasn't deferred or didn't bail out early
 	 * due to locks contention before we declare that we should give up.
@@ -3972,8 +3977,7 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
 	 * compaction.
 	 */
 	if (compaction_withdrawn(compact_result)) {
-		ret = compaction_zonelist_suitable(ac, order, alloc_flags);
-		goto out;
+		goto check_priority;
 	}
 
 	/*


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/3] mm, compaction: use MIN_COMPACT_COSTLY_PRIORITY everywhere for costly orders
  2019-08-01 13:01       ` Vlastimil Babka
@ 2019-08-01 20:33         ` Mike Kravetz
  2019-08-02 10:20           ` Vlastimil Babka
  2019-08-02 12:05           ` Vlastimil Babka
  0 siblings, 2 replies; 26+ messages in thread
From: Mike Kravetz @ 2019-08-01 20:33 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm, linux-kernel
  Cc: Hillf Danton, Michal Hocko, Mel Gorman, Johannes Weiner, Andrew Morton

On 8/1/19 6:01 AM, Vlastimil Babka wrote:
> Could you try testing the patch below instead? It should hopefully
> eliminate the stalls. If it makes hugepage allocation give up too early,
> we'll know we have to involve __GFP_RETRY_MAYFAIL in allowing the
> MIN_COMPACT_PRIORITY priority. Thanks!

Thanks.  This patch does eliminate the stalls I was seeing.

In my testing, there is little difference in how many hugetlb pages are
allocated.  It does not appear to be giving up/failing too early.  But,
this is only with __GFP_RETRY_MAYFAIL.  The real concern would with THP
requests.  Any suggestions on how to test that?

-- 
Mike Kravetz

> ----8<----
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 9569e7c786d3..b8bfe8d5d2e9 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -129,11 +129,7 @@ static inline bool compaction_failed(enum compact_result result)
>  	return false;
>  }
>  
> -/*
> - * Compaction  has backed off for some reason. It might be throttling or
> - * lock contention. Retrying is still worthwhile.
> - */
> -static inline bool compaction_withdrawn(enum compact_result result)
> +static inline bool compaction_needs_reclaim(enum compact_result result)
>  {
>  	/*
>  	 * Compaction backed off due to watermark checks for order-0
> @@ -142,6 +138,15 @@ static inline bool compaction_withdrawn(enum compact_result result)
>  	if (result == COMPACT_SKIPPED)
>  		return true;
>  
> +	return false;
> +}
> +
> +/*
> + * Compaction  has backed off for some reason. It might be throttling or
> + * lock contention. Retrying is still worthwhile.
> + */
> +static inline bool compaction_withdrawn(enum compact_result result)
> +{
>  	/*
>  	 * If compaction is deferred for high-order allocations, it is
>  	 * because sync compaction recently failed. If this is the case
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 272c6de1bf4e..3dfce1f79112 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3965,6 +3965,11 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
>  	if (compaction_failed(compact_result))
>  		goto check_priority;
>  
> +	if (compaction_needs_reclaim(compact_result)) {
> +		ret = compaction_zonelist_suitable(ac, order, alloc_flags);
> +		goto out;
> +	}
> +
>  	/*
>  	 * make sure the compaction wasn't deferred or didn't bail out early
>  	 * due to locks contention before we declare that we should give up.
> @@ -3972,8 +3977,7 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
>  	 * compaction.
>  	 */
>  	if (compaction_withdrawn(compact_result)) {
> -		ret = compaction_zonelist_suitable(ac, order, alloc_flags);
> -		goto out;
> +		goto check_priority;
>  	}
>  
>  	/*
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/3] mm, compaction: use MIN_COMPACT_COSTLY_PRIORITY everywhere for costly orders
  2019-08-01 20:33         ` Mike Kravetz
@ 2019-08-02 10:20           ` Vlastimil Babka
  2019-08-02 12:05           ` Vlastimil Babka
  1 sibling, 0 replies; 26+ messages in thread
From: Vlastimil Babka @ 2019-08-02 10:20 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Hillf Danton, Michal Hocko, Mel Gorman, Johannes Weiner, Andrew Morton

On 8/1/19 10:33 PM, Mike Kravetz wrote:
> On 8/1/19 6:01 AM, Vlastimil Babka wrote:
>> Could you try testing the patch below instead? It should hopefully
>> eliminate the stalls. If it makes hugepage allocation give up too early,
>> we'll know we have to involve __GFP_RETRY_MAYFAIL in allowing the
>> MIN_COMPACT_PRIORITY priority. Thanks!
> 
> Thanks.  This patch does eliminate the stalls I was seeing.

Great, thanks! I'll send a proper patch then.

> In my testing, there is little difference in how many hugetlb pages are
> allocated.  It does not appear to be giving up/failing too early.  But,
> this is only with __GFP_RETRY_MAYFAIL.  The real concern would with THP
> requests.  Any suggestions on how to test that?

AFAICS the default THP defrag mode is unaffected, as GFP_TRANSHUGE_LIGHT doesn't
include __GFP_DIRECT_RECLAIM, so it never reaches this code. Madvised THP
allocations will be affected, which should best be tested the same way as Andrea
and Mel did in the __GFP_THISNODE debate.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/3] mm, compaction: use MIN_COMPACT_COSTLY_PRIORITY everywhere for costly orders
  2019-08-01 20:33         ` Mike Kravetz
  2019-08-02 10:20           ` Vlastimil Babka
@ 2019-08-02 12:05           ` Vlastimil Babka
  2019-08-02 17:44             ` Mike Kravetz
  1 sibling, 1 reply; 26+ messages in thread
From: Vlastimil Babka @ 2019-08-02 12:05 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Hillf Danton, Michal Hocko, Mel Gorman, Johannes Weiner,
	Andrew Morton, Andrea Arcangeli, David Rientjes


On 8/1/19 10:33 PM, Mike Kravetz wrote:
> On 8/1/19 6:01 AM, Vlastimil Babka wrote:
>> Could you try testing the patch below instead? It should hopefully
>> eliminate the stalls. If it makes hugepage allocation give up too early,
>> we'll know we have to involve __GFP_RETRY_MAYFAIL in allowing the
>> MIN_COMPACT_PRIORITY priority. Thanks!
> 
> Thanks.  This patch does eliminate the stalls I was seeing.
> 
> In my testing, there is little difference in how many hugetlb pages are
> allocated.  It does not appear to be giving up/failing too early.  But,
> this is only with __GFP_RETRY_MAYFAIL.  The real concern would with THP
> requests.  Any suggestions on how to test that?

Here's the full patch, can you include it in your series?
As madvised THP allocations might be affected (hopefully just by stalling less,
not by failing too much), adding relevant people to CC - testing the scenarios
you care about is appreciated. Thanks.

----8<----
From 67d9db457f023434e8912a3ea571e545fb772a1b Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Fri, 2 Aug 2019 13:13:35 +0200
Subject: [PATCH] mm, compaction: raise compaction priority after it withdrawns

Mike Kravetz reports that "hugetlb allocations could stall for minutes or hours
when should_compact_retry() would return true more often then it should.
Specifically, this was in the case where compact_result was COMPACT_DEFERRED
and COMPACT_PARTIAL_SKIPPED and no progress was being made."

The problem is that the compaction_withdrawn() test in should_compact_retry()
includes compaction outcomes that are only possible on low compaction priority,
and results in a retry without increasing the priority. This may result in
furter reclaim, and more incomplete compaction attempts.

With this patch, compaction priority is raised when possible, or
should_compact_retry() returns false.

The COMPACT_SKIPPED result doesn't really fit together with the other outcomes
in compaction_withdrawn(), as that's a result caused by insufficient order-0
pages, not due to low compaction priority. With this patch, it is moved to
a new compaction_needs_reclaim() function, and for that outcome we keep the
current logic of retrying if it looks like reclaim will be able to help.

Reported-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 include/linux/compaction.h | 22 +++++++++++++++++-----
 mm/page_alloc.c            | 16 ++++++++++++----
 2 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 9569e7c786d3..4b898cdbdf05 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -129,11 +129,8 @@ static inline bool compaction_failed(enum compact_result result)
 	return false;
 }
 
-/*
- * Compaction  has backed off for some reason. It might be throttling or
- * lock contention. Retrying is still worthwhile.
- */
-static inline bool compaction_withdrawn(enum compact_result result)
+/* Compaction needs reclaim to be performed first, so it can continue. */
+static inline bool compaction_needs_reclaim(enum compact_result result)
 {
 	/*
 	 * Compaction backed off due to watermark checks for order-0
@@ -142,6 +139,16 @@ static inline bool compaction_withdrawn(enum compact_result result)
 	if (result == COMPACT_SKIPPED)
 		return true;
 
+	return false;
+}
+
+/*
+ * Compaction has backed off for some reason after doing some work or none
+ * at all. It might be throttling or lock contention. Retrying might be still
+ * worthwhile, but with a higher priority if allowed.
+ */
+static inline bool compaction_withdrawn(enum compact_result result)
+{
 	/*
 	 * If compaction is deferred for high-order allocations, it is
 	 * because sync compaction recently failed. If this is the case
@@ -207,6 +214,11 @@ static inline bool compaction_failed(enum compact_result result)
 	return false;
 }
 
+static inline bool compaction_needs_reclaim(enum compact_result result)
+{
+	return false;
+}
+
 static inline bool compaction_withdrawn(enum compact_result result)
 {
 	return true;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 272c6de1bf4e..bd0f00f8cfa3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3965,15 +3965,23 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
 	if (compaction_failed(compact_result))
 		goto check_priority;
 
+	/*
+	 * compaction was skipped because there are not enough order-0 pages
+	 * to work with, so we retry only if it looks like reclaim can help.
+	 */
+	if (compaction_needs_reclaim(compact_result)) {
+		ret = compaction_zonelist_suitable(ac, order, alloc_flags);
+		goto out;
+	}
+
 	/*
 	 * make sure the compaction wasn't deferred or didn't bail out early
 	 * due to locks contention before we declare that we should give up.
-	 * But do not retry if the given zonelist is not suitable for
-	 * compaction.
+	 * But the next retry should use a higher priority if allowed, so
+	 * we don't just keep bailing out endlessly.
 	 */
 	if (compaction_withdrawn(compact_result)) {
-		ret = compaction_zonelist_suitable(ac, order, alloc_flags);
-		goto out;
+		goto check_priority;
 	}
 
 	/*
-- 
2.22.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [RFC PATCH 2/3] mm, compaction: use MIN_COMPACT_COSTLY_PRIORITY everywhere for costly orders
  2019-08-02 12:05           ` Vlastimil Babka
@ 2019-08-02 17:44             ` Mike Kravetz
  0 siblings, 0 replies; 26+ messages in thread
From: Mike Kravetz @ 2019-08-02 17:44 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm, linux-kernel
  Cc: Hillf Danton, Michal Hocko, Mel Gorman, Johannes Weiner,
	Andrew Morton, Andrea Arcangeli, David Rientjes

On 8/2/19 5:05 AM, Vlastimil Babka wrote:
> 
> On 8/1/19 10:33 PM, Mike Kravetz wrote:
>> On 8/1/19 6:01 AM, Vlastimil Babka wrote:
>>> Could you try testing the patch below instead? It should hopefully
>>> eliminate the stalls. If it makes hugepage allocation give up too early,
>>> we'll know we have to involve __GFP_RETRY_MAYFAIL in allowing the
>>> MIN_COMPACT_PRIORITY priority. Thanks!
>>
>> Thanks.  This patch does eliminate the stalls I was seeing.
>>
>> In my testing, there is little difference in how many hugetlb pages are
>> allocated.  It does not appear to be giving up/failing too early.  But,
>> this is only with __GFP_RETRY_MAYFAIL.  The real concern would with THP
>> requests.  Any suggestions on how to test that?
> 
> Here's the full patch, can you include it in your series?

Yes.  Thank you!

-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2019-08-02 17:44 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-07-24 17:50 [RFC PATCH 0/3] fix hugetlb page allocation stalls Mike Kravetz
2019-07-24 17:50 ` [RFC PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection Mike Kravetz
2019-07-25  8:05   ` Mel Gorman
2019-07-26  7:40     ` Hillf Danton
2019-07-26  8:12     ` Mel Gorman
2019-07-31 11:08     ` Vlastimil Babka
2019-07-31 12:25       ` Mel Gorman
2019-07-31 21:11       ` Mike Kravetz
2019-08-01  4:22         ` Hillf Danton
2019-08-01  8:44         ` Vlastimil Babka
2019-08-01  3:16       ` Hillf Danton
2019-07-24 17:50 ` [RFC PATCH 2/3] mm, compaction: use MIN_COMPACT_COSTLY_PRIORITY everywhere for costly orders Mike Kravetz
2019-07-25  8:06   ` Mel Gorman
2019-07-31 12:06   ` Vlastimil Babka
2019-07-31 20:30     ` Mike Kravetz
2019-08-01 13:01       ` Vlastimil Babka
2019-08-01 20:33         ` Mike Kravetz
2019-08-02 10:20           ` Vlastimil Babka
2019-08-02 12:05           ` Vlastimil Babka
2019-08-02 17:44             ` Mike Kravetz
2019-07-24 17:50 ` [RFC PATCH 3/3] hugetlbfs: don't retry when pool page allocations start to fail Mike Kravetz
2019-07-25  8:13   ` Mel Gorman
2019-07-25 17:15     ` Mike Kravetz
2019-07-25 22:43       ` Mel Gorman
2019-07-31 13:23       ` Vlastimil Babka
2019-07-31 21:13         ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).