linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] address hugetlb page allocation stalls
@ 2019-08-02 22:39 Mike Kravetz
  2019-08-02 22:39 ` [PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection Mike Kravetz
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Mike Kravetz @ 2019-08-02 22:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hillf Danton, Vlastimil Babka, Michal Hocko, Mel Gorman,
	Johannes Weiner, Andrea Arcangeli, David Rientjes, Andrew Morton,
	Mike Kravetz

Allocation of hugetlb pages via sysctl or procfs can stall for minutes
or hours.  A simple example on a two node system with 8GB of memory is
as follows:

echo 4096 > /sys/devices/system/node/node1/hugepages/hugepages-2048kB/nr_hugepages
echo 4096 > /proc/sys/vm/nr_hugepages

Obviously, both allocation attempts will fall short of their 8GB goal.
However, one or both of these commands may stall and not be interruptible.
The issues were initially discussed in mail thread [1] and RFC code at [2].

This series addresses the issues causing the stalls.  There are two distinct
fixes, and an optimization.  The reclaim patch by Hillf and compaction patch
by Vlasitmil address corner cases in their respective areas.  hugetlb page
allocation could stall due to either of these issues.  The hugetlb patch by
Mike is an optimization suggested during the debug and development process.

[1] http://lkml.kernel.org/r/d38a095e-dc39-7e82-bb76-2c9247929f07@oracle.com
[2] http://lkml.kernel.org/r/20190724175014.9935-1-mike.kravetz@oracle.com

Hillf Danton (1):
  mm, reclaim: make should_continue_reclaim perform dryrun detection

Mike Kravetz (1):
  hugetlbfs: don't retry when pool page allocations start to fail

Vlastimil Babka (1):
  mm, compaction: raise compaction priority after it withdrawns

 include/linux/compaction.h | 22 +++++++---
 mm/hugetlb.c               | 86 +++++++++++++++++++++++++++++++++-----
 mm/page_alloc.c            | 16 +++++--
 mm/vmscan.c                | 28 +++++++------
 4 files changed, 120 insertions(+), 32 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection
  2019-08-02 22:39 [PATCH 0/3] address hugetlb page allocation stalls Mike Kravetz
@ 2019-08-02 22:39 ` Mike Kravetz
  2019-08-05  8:42   ` Vlastimil Babka
  2019-08-02 22:39 ` [PATCH 2/3] mm, compaction: raise compaction priority after it withdrawns Mike Kravetz
  2019-08-02 22:39 ` [PATCH 3/3] hugetlbfs: don't retry when pool page allocations start to fail Mike Kravetz
  2 siblings, 1 reply; 12+ messages in thread
From: Mike Kravetz @ 2019-08-02 22:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hillf Danton, Vlastimil Babka, Michal Hocko, Mel Gorman,
	Johannes Weiner, Andrea Arcangeli, David Rientjes, Andrew Morton,
	Mike Kravetz

From: Hillf Danton <hdanton@sina.com>

Address the issue of should_continue_reclaim continuing true too often
for __GFP_RETRY_MAYFAIL attempts when !nr_reclaimed and nr_scanned.
This could happen during hugetlb page allocation causing stalls for
minutes or hours.

We can stop reclaiming pages if compaction reports it can make a progress.
A code reshuffle is needed to do that. And it has side-effects, however,
with allocation latencies in other cases but that would come at the cost
of potential premature reclaim which has consequences of itself.

We can also bail out of reclaiming pages if we know that there are not
enough inactive lru pages left to satisfy the costly allocation.

We can give up reclaiming pages too if we see dryrun occur, with the
certainty of plenty of inactive pages. IOW with dryrun detected, we are
sure we have reclaimed as many pages as we could.

Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Hillf Danton <hdanton@sina.com>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
Acked-by: Mel Gorman <mgorman@suse.de>
---
 mm/vmscan.c | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 47aa2158cfac..a386c5351592 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2738,18 +2738,6 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 			return false;
 	}
 
-	/*
-	 * If we have not reclaimed enough pages for compaction and the
-	 * inactive lists are large enough, continue reclaiming
-	 */
-	pages_for_compaction = compact_gap(sc->order);
-	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
-	if (get_nr_swap_pages() > 0)
-		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
-	if (sc->nr_reclaimed < pages_for_compaction &&
-			inactive_lru_pages > pages_for_compaction)
-		return true;
-
 	/* If compaction would go ahead or the allocation would succeed, stop */
 	for (z = 0; z <= sc->reclaim_idx; z++) {
 		struct zone *zone = &pgdat->node_zones[z];
@@ -2765,7 +2753,21 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 			;
 		}
 	}
-	return true;
+
+	/*
+	 * If we have not reclaimed enough pages for compaction and the
+	 * inactive lists are large enough, continue reclaiming
+	 */
+	pages_for_compaction = compact_gap(sc->order);
+	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
+	if (get_nr_swap_pages() > 0)
+		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
+
+	return inactive_lru_pages > pages_for_compaction &&
+		/*
+		 * avoid dryrun with plenty of inactive pages
+		 */
+		nr_scanned && nr_reclaimed;
 }
 
 static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
-- 
2.20.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 2/3] mm, compaction: raise compaction priority after it withdrawns
  2019-08-02 22:39 [PATCH 0/3] address hugetlb page allocation stalls Mike Kravetz
  2019-08-02 22:39 ` [PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection Mike Kravetz
@ 2019-08-02 22:39 ` Mike Kravetz
  2019-08-05  9:14   ` Vlastimil Babka
  2019-08-02 22:39 ` [PATCH 3/3] hugetlbfs: don't retry when pool page allocations start to fail Mike Kravetz
  2 siblings, 1 reply; 12+ messages in thread
From: Mike Kravetz @ 2019-08-02 22:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hillf Danton, Vlastimil Babka, Michal Hocko, Mel Gorman,
	Johannes Weiner, Andrea Arcangeli, David Rientjes, Andrew Morton,
	Mike Kravetz

From: Vlastimil Babka <vbabka@suse.cz>

Mike Kravetz reports that "hugetlb allocations could stall for minutes or hours
when should_compact_retry() would return true more often then it should.
Specifically, this was in the case where compact_result was COMPACT_DEFERRED
and COMPACT_PARTIAL_SKIPPED and no progress was being made."

The problem is that the compaction_withdrawn() test in should_compact_retry()
includes compaction outcomes that are only possible on low compaction priority,
and results in a retry without increasing the priority. This may result in
furter reclaim, and more incomplete compaction attempts.

With this patch, compaction priority is raised when possible, or
should_compact_retry() returns false.

The COMPACT_SKIPPED result doesn't really fit together with the other outcomes
in compaction_withdrawn(), as that's a result caused by insufficient order-0
pages, not due to low compaction priority. With this patch, it is moved to
a new compaction_needs_reclaim() function, and for that outcome we keep the
current logic of retrying if it looks like reclaim will be able to help.

Reported-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 include/linux/compaction.h | 22 +++++++++++++++++-----
 mm/page_alloc.c            | 16 ++++++++++++----
 2 files changed, 29 insertions(+), 9 deletions(-)

diff --git a/include/linux/compaction.h b/include/linux/compaction.h
index 9569e7c786d3..4b898cdbdf05 100644
--- a/include/linux/compaction.h
+++ b/include/linux/compaction.h
@@ -129,11 +129,8 @@ static inline bool compaction_failed(enum compact_result result)
 	return false;
 }
 
-/*
- * Compaction  has backed off for some reason. It might be throttling or
- * lock contention. Retrying is still worthwhile.
- */
-static inline bool compaction_withdrawn(enum compact_result result)
+/* Compaction needs reclaim to be performed first, so it can continue. */
+static inline bool compaction_needs_reclaim(enum compact_result result)
 {
 	/*
 	 * Compaction backed off due to watermark checks for order-0
@@ -142,6 +139,16 @@ static inline bool compaction_withdrawn(enum compact_result result)
 	if (result == COMPACT_SKIPPED)
 		return true;
 
+	return false;
+}
+
+/*
+ * Compaction has backed off for some reason after doing some work or none
+ * at all. It might be throttling or lock contention. Retrying might be still
+ * worthwhile, but with a higher priority if allowed.
+ */
+static inline bool compaction_withdrawn(enum compact_result result)
+{
 	/*
 	 * If compaction is deferred for high-order allocations, it is
 	 * because sync compaction recently failed. If this is the case
@@ -207,6 +214,11 @@ static inline bool compaction_failed(enum compact_result result)
 	return false;
 }
 
+static inline bool compaction_needs_reclaim(enum compact_result result)
+{
+	return false;
+}
+
 static inline bool compaction_withdrawn(enum compact_result result)
 {
 	return true;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d3bb601c461b..af29c05e23aa 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3965,15 +3965,23 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
 	if (compaction_failed(compact_result))
 		goto check_priority;
 
+	/*
+	 * compaction was skipped because there are not enough order-0 pages
+	 * to work with, so we retry only if it looks like reclaim can help.
+	 */
+	if (compaction_needs_reclaim(compact_result)) {
+		ret = compaction_zonelist_suitable(ac, order, alloc_flags);
+		goto out;
+	}
+
 	/*
 	 * make sure the compaction wasn't deferred or didn't bail out early
 	 * due to locks contention before we declare that we should give up.
-	 * But do not retry if the given zonelist is not suitable for
-	 * compaction.
+	 * But the next retry should use a higher priority if allowed, so
+	 * we don't just keep bailing out endlessly.
 	 */
 	if (compaction_withdrawn(compact_result)) {
-		ret = compaction_zonelist_suitable(ac, order, alloc_flags);
-		goto out;
+		goto check_priority;
 	}
 
 	/*
-- 
2.20.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 3/3] hugetlbfs: don't retry when pool page allocations start to fail
  2019-08-02 22:39 [PATCH 0/3] address hugetlb page allocation stalls Mike Kravetz
  2019-08-02 22:39 ` [PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection Mike Kravetz
  2019-08-02 22:39 ` [PATCH 2/3] mm, compaction: raise compaction priority after it withdrawns Mike Kravetz
@ 2019-08-02 22:39 ` Mike Kravetz
  2019-08-05  9:28   ` Vlastimil Babka
  2 siblings, 1 reply; 12+ messages in thread
From: Mike Kravetz @ 2019-08-02 22:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Hillf Danton, Vlastimil Babka, Michal Hocko, Mel Gorman,
	Johannes Weiner, Andrea Arcangeli, David Rientjes, Andrew Morton,
	Mike Kravetz

When allocating hugetlbfs pool pages via /proc/sys/vm/nr_hugepages,
the pages will be interleaved between all nodes of the system.  If
nodes are not equal, it is quite possible for one node to fill up
before the others.  When this happens, the code still attempts to
allocate pages from the full node.  This results in calls to direct
reclaim and compaction which slow things down considerably.

When allocating pool pages, note the state of the previous allocation
for each node.  If previous allocation failed, do not use the
aggressive retry algorithm on successive attempts.  The allocation
will still succeed if there is memory available, but it will not try
as hard to free up memory.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 76 insertions(+), 10 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ede7e7f5d1ab..c707207e208f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1405,12 +1405,25 @@ pgoff_t __basepage_index(struct page *page)
 }
 
 static struct page *alloc_buddy_huge_page(struct hstate *h,
-		gfp_t gfp_mask, int nid, nodemask_t *nmask)
+		gfp_t gfp_mask, int nid, nodemask_t *nmask,
+		nodemask_t *node_alloc_noretry)
 {
 	int order = huge_page_order(h);
 	struct page *page;
+	bool alloc_try_hard = true;
 
-	gfp_mask |= __GFP_COMP|__GFP_RETRY_MAYFAIL|__GFP_NOWARN;
+	/*
+	 * By default we always try hard to allocate the page with
+	 * __GFP_RETRY_MAYFAIL flag.  However, if we are allocating pages in
+	 * a loop (to adjust global huge page counts) and previous allocation
+	 * failed, do not continue to try hard on the same node.  Use the
+	 * node_alloc_noretry bitmap to manage this state information.
+	 */
+	if (node_alloc_noretry && node_isset(nid, *node_alloc_noretry))
+		alloc_try_hard = false;
+	gfp_mask |= __GFP_COMP|__GFP_NOWARN;
+	if (alloc_try_hard)
+		gfp_mask |= __GFP_RETRY_MAYFAIL;
 	if (nid == NUMA_NO_NODE)
 		nid = numa_mem_id();
 	page = __alloc_pages_nodemask(gfp_mask, order, nid, nmask);
@@ -1419,6 +1432,22 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
 	else
 		__count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
 
+	/*
+	 * If we did not specify __GFP_RETRY_MAYFAIL, but still got a page this
+	 * indicates an overall state change.  Clear bit so that we resume
+	 * normal 'try hard' allocations.
+	 */
+	if (node_alloc_noretry && page && !alloc_try_hard)
+		node_clear(nid, *node_alloc_noretry);
+
+	/*
+	 * If we tried hard to get a page but failed, set bit so that
+	 * subsequent attempts will not try as hard until there is an
+	 * overall state change.
+	 */
+	if (node_alloc_noretry && !page && alloc_try_hard)
+		node_set(nid, *node_alloc_noretry);
+
 	return page;
 }
 
@@ -1427,7 +1456,8 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
  * should use this function to get new hugetlb pages
  */
 static struct page *alloc_fresh_huge_page(struct hstate *h,
-		gfp_t gfp_mask, int nid, nodemask_t *nmask)
+		gfp_t gfp_mask, int nid, nodemask_t *nmask,
+		nodemask_t *node_alloc_noretry)
 {
 	struct page *page;
 
@@ -1435,7 +1465,7 @@ static struct page *alloc_fresh_huge_page(struct hstate *h,
 		page = alloc_gigantic_page(h, gfp_mask, nid, nmask);
 	else
 		page = alloc_buddy_huge_page(h, gfp_mask,
-				nid, nmask);
+				nid, nmask, node_alloc_noretry);
 	if (!page)
 		return NULL;
 
@@ -1450,14 +1480,16 @@ static struct page *alloc_fresh_huge_page(struct hstate *h,
  * Allocates a fresh page to the hugetlb allocator pool in the node interleaved
  * manner.
  */
-static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
+static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
+				nodemask_t *node_alloc_noretry)
 {
 	struct page *page;
 	int nr_nodes, node;
 	gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
 
 	for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
-		page = alloc_fresh_huge_page(h, gfp_mask, node, nodes_allowed);
+		page = alloc_fresh_huge_page(h, gfp_mask, node, nodes_allowed,
+						node_alloc_noretry);
 		if (page)
 			break;
 	}
@@ -1601,7 +1633,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
 		goto out_unlock;
 	spin_unlock(&hugetlb_lock);
 
-	page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask);
+	page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL);
 	if (!page)
 		return NULL;
 
@@ -1637,7 +1669,7 @@ struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
 	if (hstate_is_gigantic(h))
 		return NULL;
 
-	page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask);
+	page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL);
 	if (!page)
 		return NULL;
 
@@ -2207,13 +2239,31 @@ static void __init gather_bootmem_prealloc(void)
 static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 {
 	unsigned long i;
+	nodemask_t *node_alloc_noretry;
+
+	if (!hstate_is_gigantic(h)) {
+		/*
+		 * bit mask controlling how hard we retry per-node
+		 * allocations.
+		 */
+		node_alloc_noretry = kmalloc(sizeof(*node_alloc_noretry),
+						GFP_KERNEL | __GFP_NORETRY);
+	} else {
+		/* allocations done at boot time */
+		node_alloc_noretry = NULL;
+	}
+
+	/* bit mask controlling how hard we retry per-node allocations */
+	if (node_alloc_noretry)
+		nodes_clear(*node_alloc_noretry);
 
 	for (i = 0; i < h->max_huge_pages; ++i) {
 		if (hstate_is_gigantic(h)) {
 			if (!alloc_bootmem_huge_page(h))
 				break;
 		} else if (!alloc_pool_huge_page(h,
-					 &node_states[N_MEMORY]))
+					 &node_states[N_MEMORY],
+					 node_alloc_noretry))
 			break;
 		cond_resched();
 	}
@@ -2225,6 +2275,8 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 			h->max_huge_pages, buf, i);
 		h->max_huge_pages = i;
 	}
+
+	kfree(node_alloc_noretry);
 }
 
 static void __init hugetlb_init_hstates(void)
@@ -2323,6 +2375,14 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 			      nodemask_t *nodes_allowed)
 {
 	unsigned long min_count, ret;
+	NODEMASK_ALLOC(nodemask_t, node_alloc_noretry,
+						GFP_KERNEL | __GFP_NORETRY);
+
+	/* bit mask controlling how hard we retry per-node allocations */
+	if (node_alloc_noretry)
+		nodes_clear(*node_alloc_noretry);
+	else
+		return -ENOMEM;
 
 	spin_lock(&hugetlb_lock);
 
@@ -2356,6 +2416,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
 		if (count > persistent_huge_pages(h)) {
 			spin_unlock(&hugetlb_lock);
+			if (node_alloc_noretry)
+				NODEMASK_FREE(node_alloc_noretry);
 			return -EINVAL;
 		}
 		/* Fall through to decrease pool */
@@ -2388,7 +2450,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 		/* yield cpu to avoid soft lockup */
 		cond_resched();
 
-		ret = alloc_pool_huge_page(h, nodes_allowed);
+		ret = alloc_pool_huge_page(h, nodes_allowed,
+						node_alloc_noretry);
 		spin_lock(&hugetlb_lock);
 		if (!ret)
 			goto out;
@@ -2429,6 +2492,9 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	h->max_huge_pages = persistent_huge_pages(h);
 	spin_unlock(&hugetlb_lock);
 
+	if (node_alloc_noretry)
+		NODEMASK_FREE(node_alloc_noretry);
+
 	return 0;
 }
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection
  2019-08-02 22:39 ` [PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection Mike Kravetz
@ 2019-08-05  8:42   ` Vlastimil Babka
  2019-08-05 10:57     ` Vlastimil Babka
  2019-08-05 16:54     ` Mike Kravetz
  0 siblings, 2 replies; 12+ messages in thread
From: Vlastimil Babka @ 2019-08-05  8:42 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Hillf Danton, Michal Hocko, Mel Gorman, Johannes Weiner,
	Andrea Arcangeli, David Rientjes, Andrew Morton

On 8/3/19 12:39 AM, Mike Kravetz wrote:
> From: Hillf Danton <hdanton@sina.com>
> 
> Address the issue of should_continue_reclaim continuing true too often
> for __GFP_RETRY_MAYFAIL attempts when !nr_reclaimed and nr_scanned.
> This could happen during hugetlb page allocation causing stalls for
> minutes or hours.
> 
> We can stop reclaiming pages if compaction reports it can make a progress.
> A code reshuffle is needed to do that.

> And it has side-effects, however,
> with allocation latencies in other cases but that would come at the cost
> of potential premature reclaim which has consequences of itself.

Based on Mel's longer explanation, can we clarify the wording here? e.g.:

There might be side-effect for other high-order allocations that would
potentially benefit from more reclaim before compaction for them to be
faster and less likely to stall, but the consequences of
premature/over-reclaim are considered worse.

> We can also bail out of reclaiming pages if we know that there are not
> enough inactive lru pages left to satisfy the costly allocation.
> 
> We can give up reclaiming pages too if we see dryrun occur, with the
> certainty of plenty of inactive pages. IOW with dryrun detected, we are
> sure we have reclaimed as many pages as we could.
> 
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Signed-off-by: Hillf Danton <hdanton@sina.com>
> Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
> Acked-by: Mel Gorman <mgorman@suse.de>

Acked-by: Vlastimil Babka <vbabka@suse.cz>
I will send some followup cleanup.

There should be also Mike's SOB?



> ---
>  mm/vmscan.c | 28 +++++++++++++++-------------
>  1 file changed, 15 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 47aa2158cfac..a386c5351592 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2738,18 +2738,6 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
>  			return false;
>  	}
>  
> -	/*
> -	 * If we have not reclaimed enough pages for compaction and the
> -	 * inactive lists are large enough, continue reclaiming
> -	 */
> -	pages_for_compaction = compact_gap(sc->order);
> -	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
> -	if (get_nr_swap_pages() > 0)
> -		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
> -	if (sc->nr_reclaimed < pages_for_compaction &&
> -			inactive_lru_pages > pages_for_compaction)
> -		return true;
> -
>  	/* If compaction would go ahead or the allocation would succeed, stop */
>  	for (z = 0; z <= sc->reclaim_idx; z++) {
>  		struct zone *zone = &pgdat->node_zones[z];
> @@ -2765,7 +2753,21 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
>  			;
>  		}
>  	}
> -	return true;
> +
> +	/*
> +	 * If we have not reclaimed enough pages for compaction and the
> +	 * inactive lists are large enough, continue reclaiming
> +	 */
> +	pages_for_compaction = compact_gap(sc->order);
> +	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
> +	if (get_nr_swap_pages() > 0)
> +		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
> +
> +	return inactive_lru_pages > pages_for_compaction &&
> +		/*
> +		 * avoid dryrun with plenty of inactive pages
> +		 */
> +		nr_scanned && nr_reclaimed;
>  }
>  
>  static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/3] mm, compaction: raise compaction priority after it withdrawns
  2019-08-02 22:39 ` [PATCH 2/3] mm, compaction: raise compaction priority after it withdrawns Mike Kravetz
@ 2019-08-05  9:14   ` Vlastimil Babka
  0 siblings, 0 replies; 12+ messages in thread
From: Vlastimil Babka @ 2019-08-05  9:14 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Hillf Danton, Michal Hocko, Mel Gorman, Johannes Weiner,
	Andrea Arcangeli, David Rientjes, Andrew Morton

On 8/3/19 12:39 AM, Mike Kravetz wrote:
> From: Vlastimil Babka <vbabka@suse.cz>
> 
> Mike Kravetz reports that "hugetlb allocations could stall for minutes or hours
> when should_compact_retry() would return true more often then it should.
> Specifically, this was in the case where compact_result was COMPACT_DEFERRED
> and COMPACT_PARTIAL_SKIPPED and no progress was being made."
> 
> The problem is that the compaction_withdrawn() test in should_compact_retry()
> includes compaction outcomes that are only possible on low compaction priority,
> and results in a retry without increasing the priority. This may result in
> furter reclaim, and more incomplete compaction attempts.
> 
> With this patch, compaction priority is raised when possible, or
> should_compact_retry() returns false.
> 
> The COMPACT_SKIPPED result doesn't really fit together with the other outcomes
> in compaction_withdrawn(), as that's a result caused by insufficient order-0
> pages, not due to low compaction priority. With this patch, it is moved to
> a new compaction_needs_reclaim() function, and for that outcome we keep the
> current logic of retrying if it looks like reclaim will be able to help.
> 
> Reported-by: Mike Kravetz <mike.kravetz@oracle.com>
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Tested-by: Mike Kravetz <mike.kravetz@oracle.com>

There should be also your SOB, IIUC.

> ---
>  include/linux/compaction.h | 22 +++++++++++++++++-----
>  mm/page_alloc.c            | 16 ++++++++++++----
>  2 files changed, 29 insertions(+), 9 deletions(-)
> 
> diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> index 9569e7c786d3..4b898cdbdf05 100644
> --- a/include/linux/compaction.h
> +++ b/include/linux/compaction.h
> @@ -129,11 +129,8 @@ static inline bool compaction_failed(enum compact_result result)
>  	return false;
>  }
>  
> -/*
> - * Compaction  has backed off for some reason. It might be throttling or
> - * lock contention. Retrying is still worthwhile.
> - */
> -static inline bool compaction_withdrawn(enum compact_result result)
> +/* Compaction needs reclaim to be performed first, so it can continue. */
> +static inline bool compaction_needs_reclaim(enum compact_result result)
>  {
>  	/*
>  	 * Compaction backed off due to watermark checks for order-0
> @@ -142,6 +139,16 @@ static inline bool compaction_withdrawn(enum compact_result result)
>  	if (result == COMPACT_SKIPPED)
>  		return true;
>  
> +	return false;
> +}
> +
> +/*
> + * Compaction has backed off for some reason after doing some work or none
> + * at all. It might be throttling or lock contention. Retrying might be still
> + * worthwhile, but with a higher priority if allowed.
> + */
> +static inline bool compaction_withdrawn(enum compact_result result)
> +{
>  	/*
>  	 * If compaction is deferred for high-order allocations, it is
>  	 * because sync compaction recently failed. If this is the case
> @@ -207,6 +214,11 @@ static inline bool compaction_failed(enum compact_result result)
>  	return false;
>  }
>  
> +static inline bool compaction_needs_reclaim(enum compact_result result)
> +{
> +	return false;
> +}
> +
>  static inline bool compaction_withdrawn(enum compact_result result)
>  {
>  	return true;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d3bb601c461b..af29c05e23aa 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3965,15 +3965,23 @@ should_compact_retry(struct alloc_context *ac, int order, int alloc_flags,
>  	if (compaction_failed(compact_result))
>  		goto check_priority;
>  
> +	/*
> +	 * compaction was skipped because there are not enough order-0 pages
> +	 * to work with, so we retry only if it looks like reclaim can help.
> +	 */
> +	if (compaction_needs_reclaim(compact_result)) {
> +		ret = compaction_zonelist_suitable(ac, order, alloc_flags);
> +		goto out;
> +	}
> +
>  	/*
>  	 * make sure the compaction wasn't deferred or didn't bail out early
>  	 * due to locks contention before we declare that we should give up.
> -	 * But do not retry if the given zonelist is not suitable for
> -	 * compaction.
> +	 * But the next retry should use a higher priority if allowed, so
> +	 * we don't just keep bailing out endlessly.
>  	 */
>  	if (compaction_withdrawn(compact_result)) {
> -		ret = compaction_zonelist_suitable(ac, order, alloc_flags);
> -		goto out;
> +		goto check_priority;
>  	}
>  
>  	/*
> 


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] hugetlbfs: don't retry when pool page allocations start to fail
  2019-08-02 22:39 ` [PATCH 3/3] hugetlbfs: don't retry when pool page allocations start to fail Mike Kravetz
@ 2019-08-05  9:28   ` Vlastimil Babka
  2019-08-05 17:12     ` Mike Kravetz
  0 siblings, 1 reply; 12+ messages in thread
From: Vlastimil Babka @ 2019-08-05  9:28 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Hillf Danton, Michal Hocko, Mel Gorman, Johannes Weiner,
	Andrea Arcangeli, David Rientjes, Andrew Morton

On 8/3/19 12:39 AM, Mike Kravetz wrote:
> When allocating hugetlbfs pool pages via /proc/sys/vm/nr_hugepages,
> the pages will be interleaved between all nodes of the system.  If
> nodes are not equal, it is quite possible for one node to fill up
> before the others.  When this happens, the code still attempts to
> allocate pages from the full node.  This results in calls to direct
> reclaim and compaction which slow things down considerably.
> 
> When allocating pool pages, note the state of the previous allocation
> for each node.  If previous allocation failed, do not use the
> aggressive retry algorithm on successive attempts.  The allocation
> will still succeed if there is memory available, but it will not try
> as hard to free up memory.
> 
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Looks like only part of the (agreed with) suggestions were implemented?
- set_max_huge_pages() returns -ENOMEM if nodemask can't be allocated,
but hugetlb_hstate_alloc_pages() doesn't.
- there's still __GFP_NORETRY in nodemask allocations
- (cosmetics) Mel pointed out that NODEMASK_FREE() works fine with NULL
pointers

Thanks,
Vlastimil

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection
  2019-08-05  8:42   ` Vlastimil Babka
@ 2019-08-05 10:57     ` Vlastimil Babka
  2019-08-05 16:58       ` Mike Kravetz
  2019-08-05 16:54     ` Mike Kravetz
  1 sibling, 1 reply; 12+ messages in thread
From: Vlastimil Babka @ 2019-08-05 10:57 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Hillf Danton, Michal Hocko, Mel Gorman, Johannes Weiner,
	Andrea Arcangeli, David Rientjes, Andrew Morton

On 8/5/19 10:42 AM, Vlastimil Babka wrote:
> On 8/3/19 12:39 AM, Mike Kravetz wrote:
>> From: Hillf Danton <hdanton@sina.com>
>>
>> Address the issue of should_continue_reclaim continuing true too often
>> for __GFP_RETRY_MAYFAIL attempts when !nr_reclaimed and nr_scanned.
>> This could happen during hugetlb page allocation causing stalls for
>> minutes or hours.
>>
>> We can stop reclaiming pages if compaction reports it can make a progress.
>> A code reshuffle is needed to do that.
> 
>> And it has side-effects, however,
>> with allocation latencies in other cases but that would come at the cost
>> of potential premature reclaim which has consequences of itself.
> 
> Based on Mel's longer explanation, can we clarify the wording here? e.g.:
> 
> There might be side-effect for other high-order allocations that would
> potentially benefit from more reclaim before compaction for them to be
> faster and less likely to stall, but the consequences of
> premature/over-reclaim are considered worse.
> 
>> We can also bail out of reclaiming pages if we know that there are not
>> enough inactive lru pages left to satisfy the costly allocation.
>>
>> We can give up reclaiming pages too if we see dryrun occur, with the
>> certainty of plenty of inactive pages. IOW with dryrun detected, we are
>> sure we have reclaimed as many pages as we could.
>>
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Signed-off-by: Hillf Danton <hdanton@sina.com>
>> Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
>> Acked-by: Mel Gorman <mgorman@suse.de>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> I will send some followup cleanup.

How about this?
----8<----
From 0040b32462587171ad22395a56699cc036ad483f Mon Sep 17 00:00:00 2001
From: Vlastimil Babka <vbabka@suse.cz>
Date: Mon, 5 Aug 2019 12:49:40 +0200
Subject: [PATCH] mm, reclaim: cleanup should_continue_reclaim()

After commit "mm, reclaim: make should_continue_reclaim perform dryrun
detection", closer look at the function shows, that nr_reclaimed == 0 means
the function will always return false. And since non-zero nr_reclaimed implies
non_zero nr_scanned, testing nr_scanned serves no purpose, and so does the
testing for __GFP_RETRY_MAYFAIL.

This patch thus cleans up the function to test only !nr_reclaimed upfront, and
remove the __GFP_RETRY_MAYFAIL test and nr_scanned parameter completely.
Comment is also updated, explaining that approximating "full LRU list has been
scanned" with nr_scanned == 0 didn't really work.

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
---
 mm/vmscan.c | 43 ++++++++++++++-----------------------------
 1 file changed, 14 insertions(+), 29 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index ad498b76e492..db3c9e06a888 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2582,7 +2582,6 @@ static bool in_reclaim_compaction(struct scan_control *sc)
  */
 static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 					unsigned long nr_reclaimed,
-					unsigned long nr_scanned,
 					struct scan_control *sc)
 {
 	unsigned long pages_for_compaction;
@@ -2593,28 +2592,18 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 	if (!in_reclaim_compaction(sc))
 		return false;
 
-	/* Consider stopping depending on scan and reclaim activity */
-	if (sc->gfp_mask & __GFP_RETRY_MAYFAIL) {
-		/*
-		 * For __GFP_RETRY_MAYFAIL allocations, stop reclaiming if the
-		 * full LRU list has been scanned and we are still failing
-		 * to reclaim pages. This full LRU scan is potentially
-		 * expensive but a __GFP_RETRY_MAYFAIL caller really wants to succeed
-		 */
-		if (!nr_reclaimed && !nr_scanned)
-			return false;
-	} else {
-		/*
-		 * For non-__GFP_RETRY_MAYFAIL allocations which can presumably
-		 * fail without consequence, stop if we failed to reclaim
-		 * any pages from the last SWAP_CLUSTER_MAX number of
-		 * pages that were scanned. This will return to the
-		 * caller faster at the risk reclaim/compaction and
-		 * the resulting allocation attempt fails
-		 */
-		if (!nr_reclaimed)
-			return false;
-	}
+	/*
+	 * Stop if we failed to reclaim any pages from the last SWAP_CLUSTER_MAX
+	 * number of pages that were scanned. This will return to the caller
+	 * with the risk reclaim/compaction and the resulting allocation attempt
+	 * fails. In the past we have tried harder for __GFP_RETRY_MAYFAIL
+	 * allocations through requiring that the full LRU list has been scanned
+	 * first, by assuming that zero delta of sc->nr_scanned means full LRU
+	 * scan, but that approximation was wrong, and there were corner cases
+	 * where always a non-zero amount of pages were scanned.
+	 */
+	if (!nr_reclaimed)
+		return false;
 
 	/* If compaction would go ahead or the allocation would succeed, stop */
 	for (z = 0; z <= sc->reclaim_idx; z++) {
@@ -2641,11 +2630,7 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 	if (get_nr_swap_pages() > 0)
 		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
 
-	return inactive_lru_pages > pages_for_compaction &&
-		/*
-		 * avoid dryrun with plenty of inactive pages
-		 */
-		nr_scanned && nr_reclaimed;
+	return inactive_lru_pages > pages_for_compaction;
 }
 
 static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
@@ -2810,7 +2795,7 @@ static bool shrink_node(pg_data_t *pgdat, struct scan_control *sc)
 			wait_iff_congested(BLK_RW_ASYNC, HZ/10);
 
 	} while (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
-					 sc->nr_scanned - nr_scanned, sc));
+					 sc));
 
 	/*
 	 * Kswapd gives up on balancing particular nodes after too
-- 
2.22.0



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection
  2019-08-05  8:42   ` Vlastimil Babka
  2019-08-05 10:57     ` Vlastimil Babka
@ 2019-08-05 16:54     ` Mike Kravetz
  1 sibling, 0 replies; 12+ messages in thread
From: Mike Kravetz @ 2019-08-05 16:54 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm, linux-kernel
  Cc: Hillf Danton, Michal Hocko, Mel Gorman, Johannes Weiner,
	Andrea Arcangeli, David Rientjes, Andrew Morton

On 8/5/19 1:42 AM, Vlastimil Babka wrote:
> On 8/3/19 12:39 AM, Mike Kravetz wrote:
>> From: Hillf Danton <hdanton@sina.com>
>>
>> Address the issue of should_continue_reclaim continuing true too often
>> for __GFP_RETRY_MAYFAIL attempts when !nr_reclaimed and nr_scanned.
>> This could happen during hugetlb page allocation causing stalls for
>> minutes or hours.
>>
>> We can stop reclaiming pages if compaction reports it can make a progress.
>> A code reshuffle is needed to do that.
> 
>> And it has side-effects, however,
>> with allocation latencies in other cases but that would come at the cost
>> of potential premature reclaim which has consequences of itself.
> 
> Based on Mel's longer explanation, can we clarify the wording here? e.g.:
> 
> There might be side-effect for other high-order allocations that would
> potentially benefit from more reclaim before compaction for them to be
> faster and less likely to stall, but the consequences of
> premature/over-reclaim are considered worse.
> 
>> We can also bail out of reclaiming pages if we know that there are not
>> enough inactive lru pages left to satisfy the costly allocation.
>>
>> We can give up reclaiming pages too if we see dryrun occur, with the
>> certainty of plenty of inactive pages. IOW with dryrun detected, we are
>> sure we have reclaimed as many pages as we could.
>>
>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>> Cc: Mel Gorman <mgorman@suse.de>
>> Cc: Michal Hocko <mhocko@kernel.org>
>> Cc: Vlastimil Babka <vbabka@suse.cz>
>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>> Signed-off-by: Hillf Danton <hdanton@sina.com>
>> Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
>> Acked-by: Mel Gorman <mgorman@suse.de>
> 
> Acked-by: Vlastimil Babka <vbabka@suse.cz>
> I will send some followup cleanup.
> 
> There should be also Mike's SOB?

Will do.
My apologies, the process of handling patches created by others is new
to me.

Also, will incorporate Mel's explanation.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection
  2019-08-05 10:57     ` Vlastimil Babka
@ 2019-08-05 16:58       ` Mike Kravetz
  2019-08-05 18:34         ` Vlastimil Babka
  0 siblings, 1 reply; 12+ messages in thread
From: Mike Kravetz @ 2019-08-05 16:58 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm, linux-kernel
  Cc: Hillf Danton, Michal Hocko, Mel Gorman, Johannes Weiner,
	Andrea Arcangeli, David Rientjes, Andrew Morton

On 8/5/19 3:57 AM, Vlastimil Babka wrote:
> On 8/5/19 10:42 AM, Vlastimil Babka wrote:
>> On 8/3/19 12:39 AM, Mike Kravetz wrote:
>>> From: Hillf Danton <hdanton@sina.com>
>>>
>>> Address the issue of should_continue_reclaim continuing true too often
>>> for __GFP_RETRY_MAYFAIL attempts when !nr_reclaimed and nr_scanned.
>>> This could happen during hugetlb page allocation causing stalls for
>>> minutes or hours.
>>>
>>> We can stop reclaiming pages if compaction reports it can make a progress.
>>> A code reshuffle is needed to do that.
>>
>>> And it has side-effects, however,
>>> with allocation latencies in other cases but that would come at the cost
>>> of potential premature reclaim which has consequences of itself.
>>
>> Based on Mel's longer explanation, can we clarify the wording here? e.g.:
>>
>> There might be side-effect for other high-order allocations that would
>> potentially benefit from more reclaim before compaction for them to be
>> faster and less likely to stall, but the consequences of
>> premature/over-reclaim are considered worse.
>>
>>> We can also bail out of reclaiming pages if we know that there are not
>>> enough inactive lru pages left to satisfy the costly allocation.
>>>
>>> We can give up reclaiming pages too if we see dryrun occur, with the
>>> certainty of plenty of inactive pages. IOW with dryrun detected, we are
>>> sure we have reclaimed as many pages as we could.
>>>
>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>>> Cc: Mel Gorman <mgorman@suse.de>
>>> Cc: Michal Hocko <mhocko@kernel.org>
>>> Cc: Vlastimil Babka <vbabka@suse.cz>
>>> Cc: Johannes Weiner <hannes@cmpxchg.org>
>>> Signed-off-by: Hillf Danton <hdanton@sina.com>
>>> Tested-by: Mike Kravetz <mike.kravetz@oracle.com>
>>> Acked-by: Mel Gorman <mgorman@suse.de>
>>
>> Acked-by: Vlastimil Babka <vbabka@suse.cz>
>> I will send some followup cleanup.
> 
> How about this?
> ----8<----
> From 0040b32462587171ad22395a56699cc036ad483f Mon Sep 17 00:00:00 2001
> From: Vlastimil Babka <vbabka@suse.cz>
> Date: Mon, 5 Aug 2019 12:49:40 +0200
> Subject: [PATCH] mm, reclaim: cleanup should_continue_reclaim()
> 
> After commit "mm, reclaim: make should_continue_reclaim perform dryrun
> detection", closer look at the function shows, that nr_reclaimed == 0 means
> the function will always return false. And since non-zero nr_reclaimed implies
> non_zero nr_scanned, testing nr_scanned serves no purpose, and so does the
> testing for __GFP_RETRY_MAYFAIL.
> 
> This patch thus cleans up the function to test only !nr_reclaimed upfront, and
> remove the __GFP_RETRY_MAYFAIL test and nr_scanned parameter completely.
> Comment is also updated, explaining that approximating "full LRU list has been
> scanned" with nr_scanned == 0 didn't really work.
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>

Acked-by: Mike Kravetz <mike.kravetz@oracle.com>

Would you like me to add this to the series, or do you want to send later?
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/3] hugetlbfs: don't retry when pool page allocations start to fail
  2019-08-05  9:28   ` Vlastimil Babka
@ 2019-08-05 17:12     ` Mike Kravetz
  0 siblings, 0 replies; 12+ messages in thread
From: Mike Kravetz @ 2019-08-05 17:12 UTC (permalink / raw)
  To: Vlastimil Babka, linux-mm, linux-kernel
  Cc: Hillf Danton, Michal Hocko, Mel Gorman, Johannes Weiner,
	Andrea Arcangeli, David Rientjes, Andrew Morton

On 8/5/19 2:28 AM, Vlastimil Babka wrote:
> On 8/3/19 12:39 AM, Mike Kravetz wrote:
>> When allocating hugetlbfs pool pages via /proc/sys/vm/nr_hugepages,
>> the pages will be interleaved between all nodes of the system.  If
>> nodes are not equal, it is quite possible for one node to fill up
>> before the others.  When this happens, the code still attempts to
>> allocate pages from the full node.  This results in calls to direct
>> reclaim and compaction which slow things down considerably.
>>
>> When allocating pool pages, note the state of the previous allocation
>> for each node.  If previous allocation failed, do not use the
>> aggressive retry algorithm on successive attempts.  The allocation
>> will still succeed if there is memory available, but it will not try
>> as hard to free up memory.
>>
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> Looks like only part of the (agreed with) suggestions were implemented?

My bad, I pulled in the wrong patch.

> - set_max_huge_pages() returns -ENOMEM if nodemask can't be allocated,
> but hugetlb_hstate_alloc_pages() doesn't.

That is somewhat intentional.  The calling context of the two routines is
significantly different.   hugetlb_hstate_alloc_pages is called at boot time
to handle command line parameters.  And, hugetlb_hstate_alloc_pages does not
return a value as it is of type void.

We 'could' print out a warning here.  But, if we can't allocate a node mask
I am pretty sure we will not be able to boot.  I will add a comment.

> - there's still __GFP_NORETRY in nodemask allocations
> - (cosmetics) Mel pointed out that NODEMASK_FREE() works fine with NULL
> pointers

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection
  2019-08-05 16:58       ` Mike Kravetz
@ 2019-08-05 18:34         ` Vlastimil Babka
  0 siblings, 0 replies; 12+ messages in thread
From: Vlastimil Babka @ 2019-08-05 18:34 UTC (permalink / raw)
  To: Mike Kravetz, linux-mm, linux-kernel
  Cc: Hillf Danton, Michal Hocko, Mel Gorman, Johannes Weiner,
	Andrea Arcangeli, David Rientjes, Andrew Morton

On 8/5/19 6:58 PM, Mike Kravetz wrote:
>> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> 
> Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> Would you like me to add this to the series, or do you want to send later?

Please add, thanks!


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2019-08-05 18:38 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-02 22:39 [PATCH 0/3] address hugetlb page allocation stalls Mike Kravetz
2019-08-02 22:39 ` [PATCH 1/3] mm, reclaim: make should_continue_reclaim perform dryrun detection Mike Kravetz
2019-08-05  8:42   ` Vlastimil Babka
2019-08-05 10:57     ` Vlastimil Babka
2019-08-05 16:58       ` Mike Kravetz
2019-08-05 18:34         ` Vlastimil Babka
2019-08-05 16:54     ` Mike Kravetz
2019-08-02 22:39 ` [PATCH 2/3] mm, compaction: raise compaction priority after it withdrawns Mike Kravetz
2019-08-05  9:14   ` Vlastimil Babka
2019-08-02 22:39 ` [PATCH 3/3] hugetlbfs: don't retry when pool page allocations start to fail Mike Kravetz
2019-08-05  9:28   ` Vlastimil Babka
2019-08-05 17:12     ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).