linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [Question] Should direct reclaim time be bounded?
@ 2019-04-23  4:07 Mike Kravetz
  2019-04-23  7:19 ` Michal Hocko
  0 siblings, 1 reply; 20+ messages in thread
From: Mike Kravetz @ 2019-04-23  4:07 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Michal Hocko, Andrea Arcangeli, Mel Gorman, Vlastimil Babka,
	Johannes Weiner

I was looking into an issue on our distro kernel where allocation of huge
pages via "echo X > /proc/sys/vm/nr_hugepages" was taking a LONG time.
In this particular case, we were actually allocating huge pages VERY slowly
at the rate of about one every 30 seconds.  I don't want to talk about the
code in our distro kernel, but the situation that caused this issue exists
upstream and appears to be worse there.

One thing to note is that hugetlb page allocation can really stress the
page allocator.  The routine alloc_pool_huge_page is of special concern.

/*
 * Allocates a fresh page to the hugetlb allocator pool in the node interleaved
 * manner.
 */
static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
{
	struct page *page;
	int nr_nodes, node;
	gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;

	for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
		page = alloc_fresh_huge_page(h, gfp_mask, node, nodes_allowed);
		if (page)
			break;
	}

	if (!page)
		return 0;

	put_page(page); /* free it into the hugepage allocator */

	return 1;
}

This routine is called for each huge page the user wants to allocate.  If
they do "echo 4096 > nr_hugepages", this is called 4096 times.
alloc_fresh_huge_page() will eventually call __alloc_pages_nodemask with
__GFP_COMP|__GFP_RETRY_MAYFAIL|__GFP_NOWARN in addition to __GFP_THISNODE.
That for_each_node_mask_to_alloc() macro is hugetlbfs specific and attempts
to allocate huge pages in a round robin fashion.  When asked to allocate a
huge page, it first tries the 'next_nid_to_alloc'.  If that fails, it goes
to the next allowed node.  This is 'documented' in kernel docs as:

"On a NUMA platform, the kernel will attempt to distribute the huge page pool
 over all the set of allowed nodes specified by the NUMA memory policy of the
 task that modifies nr_hugepages.  The default for the allowed nodes--when the
 task has default memory policy--is all on-line nodes with memory.  Allowed
 nodes with insufficient available, contiguous memory for a huge page will be
 silently skipped when allocating persistent huge pages.  See the discussion
 below of the interaction of task memory policy, cpusets and per node attributes
 with the allocation and freeing of persistent huge pages.

 The success or failure of huge page allocation depends on the amount of
 physically contiguous memory that is present in system at the time of the
 allocation attempt.  If the kernel is unable to allocate huge pages from
 some nodes in a NUMA system, it will attempt to make up the difference by
 allocating extra pages on other nodes with sufficient available contiguous
 memory, if any."

However, consider the case of a 2 node system where:
node 0 has 2GB memory
node 1 has 4GB memory

Now, if one wants to allocate 4GB of huge pages they may be tempted to simply,
"echo 2048 > nr_hugepages".  At first this will go well until node 0 is out
of memory.  When this happens, alloc_pool_huge_page() will continue to be
called.  Because of that for_each_node_mask_to_alloc() macro, it will likely
attempt to first allocate a page from node 0.  It will call direct reclaim and
compaction until it fails.  Then, it will successfully allocate from node 1.

In our distro kernel, I am thinking about making allocations try "less hard"
on nodes where we start to see failures.  less hard == NORETRY/NORECLAIM.
I was going to try something like this on an upstream kernel when I noticed
that it seems like direct reclaim may never end/exit.  It 'may' exit, but I
instrumented __alloc_pages_slowpath() and saw it take well over an hour
before I 'tricked' it into exiting.

[ 5916.248341] hpage_slow_alloc: jiffies 5295742  tries 2   node 0 success
[ 5916.249271]                   reclaim 5295741  compact 1

This is where it stalled after "echo 4096 > nr_hugepages" on a little VM
with 8GB total memory.

I have not started looking at the direct reclaim code to see exactly where
we may be stuck, or trying really hard.  My question is, "Is this expected
or should direct reclaim be somewhat bounded?"  With __alloc_pages_slowpath
getting 'stuck' in direct reclaim, the documented behavior for huge page
allocation is not going to happen.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 20+ messages in thread
* Re: [Question] Should direct reclaim time be bounded?
@ 2019-07-11 15:44 Hillf Danton
  0 siblings, 0 replies; 20+ messages in thread
From: Hillf Danton @ 2019-07-11 15:44 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Vlastimil Babka, Michal Hocko, Mel Gorman, linux-mm,
	linux-kernel, Johannes Weiner


On Thu, 11 Jul 2019 02:42:56 +0800 Mike Kravetz wrote:
> On 7/7/19 10:19 PM, Hillf Danton wrote:
> > On Mon, 01 Jul 2019 20:15:51 -0700 Mike Kravetz wrote:
> >> On 7/1/19 1:59 AM, Mel Gorman wrote:
> >>>
> >>> I think it would be reasonable to have should_continue_reclaim allow an
> >>> exit if scanning at higher priority than DEF_PRIORITY - 2, nr_scanned is
> >>> less than SWAP_CLUSTER_MAX and no pages are being reclaimed.
> >>
> >> Thanks Mel,
> >>
> >> I added such a check to should_continue_reclaim.  However, it does not
> >> address the issue I am seeing.  In that do-while loop in shrink_node,
> >> the scan priority is not raised (priority--).  We can enter the loop
> >> with priority == DEF_PRIORITY and continue to loop for minutes as seen
> >> in my previous debug output.
> >>
> > Does it help raise prioity in your case?
> 
> Thanks Hillf,  sorry for delay in responding I have been AFK.
> 
> I am not sure if you wanted to try this somehow in addition to Mel's
> suggestion, or alone.
> 
I wanted you might take a look at it if you continued to have difficulty
raising scanning priority.

> Unfortunately, such a change actually causes worse behavior.
> 
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -2543,11 +2543,18 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
> >  	unsigned long pages_for_compaction;
> >  	unsigned long inactive_lru_pages;
> >  	int z;
> > +	bool costly_fg_reclaim = false;
> > 
> >  	/* If not in reclaim/compaction mode, stop */
> >  	if (!in_reclaim_compaction(sc))
> >  		return false;
> > 
> > +	/* Let compact determine what to do for high order allocators */
> > +	costly_fg_reclaim = sc->order > PAGE_ALLOC_COSTLY_ORDER &&
> > +				!current_is_kswapd();
> > +	if (costly_fg_reclaim)
> > +		goto check_compact;
> 
> This goto makes us skip the 'if (!nr_reclaimed && !nr_scanned)' test.
> 
Correct.

> > +
> >  	/* Consider stopping depending on scan and reclaim activity */
> >  	if (sc->gfp_mask & __GFP_RETRY_MAYFAIL) {
> >  		/*
> > @@ -2571,6 +2578,7 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
> >  			return false;
> >  	}
> > 
> > +check_compact:
> >  	/*
> >  	 * If we have not reclaimed enough pages for compaction and the
> >  	 * inactive lists are large enough, continue reclaiming
> 
> It is quite easy to hit the condition where:
> nr_reclaimed == 0  && nr_scanned == 0 is true, but we skip the previous test
> 
Erm it is becoming harder to imagine what prevented you from raising priority
when it was not difficult to hit the true condition above.

And I see the following in the mail thread,
===
	Date: Fri, 28 Jun 2019 11:20:42 -0700
	Message-ID: <dede2f84-90bf-347a-2a17-fb6b521bf573@oracle.com> (raw)
	In-Reply-To: <04329fea-cd34-4107-d1d4-b2098ebab0ec@suse.cz>

	I got back to looking into the direct reclaim/compaction stalls when
	trying to allocate huge pages.  As previously mentioned, the code is
	looping for a long time in shrink_node().  The routine
	should_continue_reclaim() returns true perhaps more often than it should.

	As Vlastmil guessed, my debug code output below shows nr_scanned is remaining
	non-zero for quite a while.  This was on v5.2-rc6.
===
nr_scanned != 0 and the result of should_continue_reclaim() is not false, which
is unable to match the condition you easily hit.


> and the compaction check:
> sc->nr_reclaimed < pages_for_compaction &&
> 	inactive_lru_pages > pages_for_compaction
> 
> is true, so we return true before the below check of costly_fg_reclaim
> 
It is the price high order allocations pay: reclaiming enough pages for
compact to do its work. With plenty of inactive pages you got no pages
reclaimed and scanned. It is really hard to imagine. And costly_fg_reclaim
is not good for that imho.

> > @@ -2583,6 +2591,9 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
> >  			inactive_lru_pages > pages_for_compaction)
> >  		return true;
> > 
> > +	if (costly_fg_reclaim)
> > +		return false;
> > +
> >  	/* If compaction would go ahead or the allocation would succeed, stop */
> >  	for (z = 0; z <= sc->reclaim_idx; z++) {
> >  		struct zone *zone = &pgdat->node_zones[z];
> > --
> >
> 
> As Michal suggested, I'm going to do some testing to see what impact
> dropping the __GFP_RETRY_MAYFAIL flag for these huge page allocations
> will have on the number of pages allocated.
> --
> Mike Kravetz
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread
* Re: [Question] Should direct reclaim time be bounded?
@ 2019-07-12  5:47 Hillf Danton
  2019-07-13  1:11 ` Mike Kravetz
  0 siblings, 1 reply; 20+ messages in thread
From: Hillf Danton @ 2019-07-12  5:47 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Vlastimil Babka, Michal Hocko, Mel Gorman, linux-mm,
	linux-kernel, Johannes Weiner


On Thu, 11 Jul 2019 02:42:56 +0800 Mike Kravetz wrote:
>
> It is quite easy to hit the condition where:
> nr_reclaimed == 0  && nr_scanned == 0 is true, but we skip the previous test
>
Then skipping check of __GFP_RETRY_MAYFAIL makes no sense in your case.
It is restored in respin below.

> and the compaction check:
> sc->nr_reclaimed < pages_for_compaction &&
> 	inactive_lru_pages > pages_for_compaction
> is true, so we return true before the below check of costly_fg_reclaim
>
This check is placed after COMPACT_SUCCESS; the latter is used to
replace sc->nr_reclaimed < pages_for_compaction.

And dryrun detection is added based on the result of last round of
shrinking of inactive pages, particularly when their number is large
enough.


--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2571,18 +2571,6 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 			return false;
 	}

-	/*
-	 * If we have not reclaimed enough pages for compaction and the
-	 * inactive lists are large enough, continue reclaiming
-	 */
-	pages_for_compaction = compact_gap(sc->order);
-	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
-	if (get_nr_swap_pages() > 0)
-		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
-	if (sc->nr_reclaimed < pages_for_compaction &&
-			inactive_lru_pages > pages_for_compaction)
-		return true;
-
 	/* If compaction would go ahead or the allocation would succeed, stop */
 	for (z = 0; z <= sc->reclaim_idx; z++) {
 		struct zone *zone = &pgdat->node_zones[z];
@@ -2598,7 +2586,21 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
 			;
 		}
 	}
-	return true;
+
+	/*
+	 * If we have not reclaimed enough pages for compaction and the
+	 * inactive lists are large enough, continue reclaiming
+	 */
+	pages_for_compaction = compact_gap(sc->order);
+	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
+	if (get_nr_swap_pages() > 0)
+		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
+
+	return inactive_lru_pages > pages_for_compaction &&
+		/*
+		 * avoid dryrun with plenty of inactive pages
+		 */
+		nr_scanned && nr_reclaimed;
 }

 static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
--


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2019-07-13  1:11 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-23  4:07 [Question] Should direct reclaim time be bounded? Mike Kravetz
2019-04-23  7:19 ` Michal Hocko
2019-04-23 16:39   ` Mike Kravetz
2019-04-24 14:35     ` Vlastimil Babka
2019-06-28 18:20       ` Mike Kravetz
2019-07-01  8:59         ` Mel Gorman
2019-07-02  3:15           ` Mike Kravetz
2019-07-03  9:43             ` Mel Gorman
2019-07-03 23:54               ` Mike Kravetz
2019-07-04 11:09                 ` Michal Hocko
2019-07-04 15:11                   ` Mike Kravetz
2019-07-08  5:19             ` Hillf Danton
2019-07-10 18:42             ` Mike Kravetz
2019-07-10 19:44               ` Michal Hocko
2019-07-10 23:36                 ` Mike Kravetz
2019-07-11  7:12                   ` Michal Hocko
2019-07-12  9:49                     ` Mel Gorman
2019-07-11 15:44 Hillf Danton
2019-07-12  5:47 Hillf Danton
2019-07-13  1:11 ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).