linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [Question] Should direct reclaim time be bounded?
@ 2019-04-23  4:07 Mike Kravetz
  2019-04-23  7:19 ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Kravetz @ 2019-04-23  4:07 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Michal Hocko, Andrea Arcangeli, Mel Gorman, Vlastimil Babka,
	Johannes Weiner

I was looking into an issue on our distro kernel where allocation of huge
pages via "echo X > /proc/sys/vm/nr_hugepages" was taking a LONG time.
In this particular case, we were actually allocating huge pages VERY slowly
at the rate of about one every 30 seconds.  I don't want to talk about the
code in our distro kernel, but the situation that caused this issue exists
upstream and appears to be worse there.

One thing to note is that hugetlb page allocation can really stress the
page allocator.  The routine alloc_pool_huge_page is of special concern.

/*
 * Allocates a fresh page to the hugetlb allocator pool in the node interleaved
 * manner.
 */
static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
{
	struct page *page;
	int nr_nodes, node;
	gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;

	for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
		page = alloc_fresh_huge_page(h, gfp_mask, node, nodes_allowed);
		if (page)
			break;
	}

	if (!page)
		return 0;

	put_page(page); /* free it into the hugepage allocator */

	return 1;
}

This routine is called for each huge page the user wants to allocate.  If
they do "echo 4096 > nr_hugepages", this is called 4096 times.
alloc_fresh_huge_page() will eventually call __alloc_pages_nodemask with
__GFP_COMP|__GFP_RETRY_MAYFAIL|__GFP_NOWARN in addition to __GFP_THISNODE.
That for_each_node_mask_to_alloc() macro is hugetlbfs specific and attempts
to allocate huge pages in a round robin fashion.  When asked to allocate a
huge page, it first tries the 'next_nid_to_alloc'.  If that fails, it goes
to the next allowed node.  This is 'documented' in kernel docs as:

"On a NUMA platform, the kernel will attempt to distribute the huge page pool
 over all the set of allowed nodes specified by the NUMA memory policy of the
 task that modifies nr_hugepages.  The default for the allowed nodes--when the
 task has default memory policy--is all on-line nodes with memory.  Allowed
 nodes with insufficient available, contiguous memory for a huge page will be
 silently skipped when allocating persistent huge pages.  See the discussion
 below of the interaction of task memory policy, cpusets and per node attributes
 with the allocation and freeing of persistent huge pages.

 The success or failure of huge page allocation depends on the amount of
 physically contiguous memory that is present in system at the time of the
 allocation attempt.  If the kernel is unable to allocate huge pages from
 some nodes in a NUMA system, it will attempt to make up the difference by
 allocating extra pages on other nodes with sufficient available contiguous
 memory, if any."

However, consider the case of a 2 node system where:
node 0 has 2GB memory
node 1 has 4GB memory

Now, if one wants to allocate 4GB of huge pages they may be tempted to simply,
"echo 2048 > nr_hugepages".  At first this will go well until node 0 is out
of memory.  When this happens, alloc_pool_huge_page() will continue to be
called.  Because of that for_each_node_mask_to_alloc() macro, it will likely
attempt to first allocate a page from node 0.  It will call direct reclaim and
compaction until it fails.  Then, it will successfully allocate from node 1.

In our distro kernel, I am thinking about making allocations try "less hard"
on nodes where we start to see failures.  less hard == NORETRY/NORECLAIM.
I was going to try something like this on an upstream kernel when I noticed
that it seems like direct reclaim may never end/exit.  It 'may' exit, but I
instrumented __alloc_pages_slowpath() and saw it take well over an hour
before I 'tricked' it into exiting.

[ 5916.248341] hpage_slow_alloc: jiffies 5295742  tries 2   node 0 success
[ 5916.249271]                   reclaim 5295741  compact 1

This is where it stalled after "echo 4096 > nr_hugepages" on a little VM
with 8GB total memory.

I have not started looking at the direct reclaim code to see exactly where
we may be stuck, or trying really hard.  My question is, "Is this expected
or should direct reclaim be somewhat bounded?"  With __alloc_pages_slowpath
getting 'stuck' in direct reclaim, the documented behavior for huge page
allocation is not going to happen.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Question] Should direct reclaim time be bounded?
  2019-04-23  4:07 [Question] Should direct reclaim time be bounded? Mike Kravetz
@ 2019-04-23  7:19 ` Michal Hocko
  2019-04-23 16:39   ` Mike Kravetz
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2019-04-23  7:19 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Vlastimil Babka, Johannes Weiner

On Mon 22-04-19 21:07:28, Mike Kravetz wrote:
[...]
> However, consider the case of a 2 node system where:
> node 0 has 2GB memory
> node 1 has 4GB memory
> 
> Now, if one wants to allocate 4GB of huge pages they may be tempted to simply,
> "echo 2048 > nr_hugepages".  At first this will go well until node 0 is out
> of memory.  When this happens, alloc_pool_huge_page() will continue to be
> called.  Because of that for_each_node_mask_to_alloc() macro, it will likely
> attempt to first allocate a page from node 0.  It will call direct reclaim and
> compaction until it fails.  Then, it will successfully allocate from node 1.

Yeah, the even distribution is quite a strong statement. We just try to
distribute somehow and it is likely to not work really great on system
with nodes that are different in size. I know it sucks but I've been
recommending to use the /sys/devices/system/node/node$N/hugepages/hugepages-2048kB/nr_hugepages
because that allows the define the actual policy much better. I guess we
want to be more specific about this in the documentation at least.

> In our distro kernel, I am thinking about making allocations try "less hard"
> on nodes where we start to see failures.  less hard == NORETRY/NORECLAIM.
> I was going to try something like this on an upstream kernel when I noticed
> that it seems like direct reclaim may never end/exit.  It 'may' exit, but I
> instrumented __alloc_pages_slowpath() and saw it take well over an hour
> before I 'tricked' it into exiting.
> 
> [ 5916.248341] hpage_slow_alloc: jiffies 5295742  tries 2   node 0 success
> [ 5916.249271]                   reclaim 5295741  compact 1

This is unexpected though. What does tries mean? Number of reclaim
attempts? If yes could you enable tracing to see what takes so long in
the reclaim path?

> This is where it stalled after "echo 4096 > nr_hugepages" on a little VM
> with 8GB total memory.
> 
> I have not started looking at the direct reclaim code to see exactly where
> we may be stuck, or trying really hard.  My question is, "Is this expected
> or should direct reclaim be somewhat bounded?"  With __alloc_pages_slowpath
> getting 'stuck' in direct reclaim, the documented behavior for huge page
> allocation is not going to happen.

Well, our "how hard to try for hugetlb pages" is quite arbitrary. We
used to rety as long as at least order worth of pages have been
reclaimed but that didn't make any sense since the lumpy reclaim was
gone. So the semantic has change to reclaim&compact as long as there is
some progress. From what I understad above it seems that you are not
thrashing and calling reclaim again and again but rather one reclaim
round takes ages.

That being said, I do not think __GFP_RETRY_MAYFAIL is wrong here. It
looks like there is something wrong in the reclaim going on.

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Question] Should direct reclaim time be bounded?
  2019-04-23  7:19 ` Michal Hocko
@ 2019-04-23 16:39   ` Mike Kravetz
  2019-04-24 14:35     ` Vlastimil Babka
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Kravetz @ 2019-04-23 16:39 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Mel Gorman,
	Vlastimil Babka, Johannes Weiner

On 4/23/19 12:19 AM, Michal Hocko wrote:
> On Mon 22-04-19 21:07:28, Mike Kravetz wrote:
>> In our distro kernel, I am thinking about making allocations try "less hard"
>> on nodes where we start to see failures.  less hard == NORETRY/NORECLAIM.
>> I was going to try something like this on an upstream kernel when I noticed
>> that it seems like direct reclaim may never end/exit.  It 'may' exit, but I
>> instrumented __alloc_pages_slowpath() and saw it take well over an hour
>> before I 'tricked' it into exiting.
>>
>> [ 5916.248341] hpage_slow_alloc: jiffies 5295742  tries 2   node 0 success
>> [ 5916.249271]                   reclaim 5295741  compact 1
> 
> This is unexpected though. What does tries mean? Number of reclaim
> attempts? If yes could you enable tracing to see what takes so long in
> the reclaim path?

tries is the number of times we pass the 'retry:' label in
__alloc_pages_slowpath.  In this specific case, I am pretty sure all that
time is in one call to __alloc_pages_direct_reclaim.  My 'trick' to make this
succeed was to "echo 0 > nr_hugepages" in another shell.

>> This is where it stalled after "echo 4096 > nr_hugepages" on a little VM
>> with 8GB total memory.
>>
>> I have not started looking at the direct reclaim code to see exactly where
>> we may be stuck, or trying really hard.  My question is, "Is this expected
>> or should direct reclaim be somewhat bounded?"  With __alloc_pages_slowpath
>> getting 'stuck' in direct reclaim, the documented behavior for huge page
>> allocation is not going to happen.
> 
> Well, our "how hard to try for hugetlb pages" is quite arbitrary. We
> used to rety as long as at least order worth of pages have been
> reclaimed but that didn't make any sense since the lumpy reclaim was
> gone.

Yes, that is what I am seeing in our older distro kernel and I can at least
deal with that.

>       So the semantic has change to reclaim&compact as long as there is
> some progress. From what I understad above it seems that you are not
> thrashing and calling reclaim again and again but rather one reclaim
> round takes ages.

Correct

> That being said, I do not think __GFP_RETRY_MAYFAIL is wrong here. It
> looks like there is something wrong in the reclaim going on.

Ok, I will start digging into that.  Just wanted to make sure before I got
into it too deep.

BTW - This is very easy to reproduce.  Just try to allocate more huge pages
than will fit into memory.  I see this 'reclaim taking forever' behavior on
v5.1-rc5-mmotm-2019-04-19-14-53.  Looks like it was there in v5.0 as well.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Question] Should direct reclaim time be bounded?
  2019-04-23 16:39   ` Mike Kravetz
@ 2019-04-24 14:35     ` Vlastimil Babka
  2019-06-28 18:20       ` Mike Kravetz
  0 siblings, 1 reply; 17+ messages in thread
From: Vlastimil Babka @ 2019-04-24 14:35 UTC (permalink / raw)
  To: Mike Kravetz, Michal Hocko
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Mel Gorman, Johannes Weiner

On 4/23/19 6:39 PM, Mike Kravetz wrote:
>> That being said, I do not think __GFP_RETRY_MAYFAIL is wrong here. It
>> looks like there is something wrong in the reclaim going on.
> 
> Ok, I will start digging into that.  Just wanted to make sure before I got
> into it too deep.
> 
> BTW - This is very easy to reproduce.  Just try to allocate more huge pages
> than will fit into memory.  I see this 'reclaim taking forever' behavior on
> v5.1-rc5-mmotm-2019-04-19-14-53.  Looks like it was there in v5.0 as well.

I'd suspect this in should_continue_reclaim():

        /* Consider stopping depending on scan and reclaim activity */
        if (sc->gfp_mask & __GFP_RETRY_MAYFAIL) {
                /*
                 * For __GFP_RETRY_MAYFAIL allocations, stop reclaiming if the
                 * full LRU list has been scanned and we are still failing
                 * to reclaim pages. This full LRU scan is potentially
                 * expensive but a __GFP_RETRY_MAYFAIL caller really wants to succeed
                 */
                if (!nr_reclaimed && !nr_scanned)
                        return false;

And that for some reason, nr_scanned never becomes zero. But it's hard
to figure out through all the layers of functions :/

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Question] Should direct reclaim time be bounded?
  2019-04-24 14:35     ` Vlastimil Babka
@ 2019-06-28 18:20       ` Mike Kravetz
  2019-07-01  8:59         ` Mel Gorman
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Kravetz @ 2019-06-28 18:20 UTC (permalink / raw)
  To: Vlastimil Babka, Michal Hocko
  Cc: linux-mm, linux-kernel, Andrea Arcangeli, Mel Gorman, Johannes Weiner

On 4/24/19 7:35 AM, Vlastimil Babka wrote:
> On 4/23/19 6:39 PM, Mike Kravetz wrote:
>>> That being said, I do not think __GFP_RETRY_MAYFAIL is wrong here. It
>>> looks like there is something wrong in the reclaim going on.
>>
>> Ok, I will start digging into that.  Just wanted to make sure before I got
>> into it too deep.
>>
>> BTW - This is very easy to reproduce.  Just try to allocate more huge pages
>> than will fit into memory.  I see this 'reclaim taking forever' behavior on
>> v5.1-rc5-mmotm-2019-04-19-14-53.  Looks like it was there in v5.0 as well.
> 
> I'd suspect this in should_continue_reclaim():
> 
>         /* Consider stopping depending on scan and reclaim activity */
>         if (sc->gfp_mask & __GFP_RETRY_MAYFAIL) {
>                 /*
>                  * For __GFP_RETRY_MAYFAIL allocations, stop reclaiming if the
>                  * full LRU list has been scanned and we are still failing
>                  * to reclaim pages. This full LRU scan is potentially
>                  * expensive but a __GFP_RETRY_MAYFAIL caller really wants to succeed
>                  */
>                 if (!nr_reclaimed && !nr_scanned)
>                         return false;
> 
> And that for some reason, nr_scanned never becomes zero. But it's hard
> to figure out through all the layers of functions :/

I got back to looking into the direct reclaim/compaction stalls when
trying to allocate huge pages.  As previously mentioned, the code is
looping for a long time in shrink_node().  The routine
should_continue_reclaim() returns true perhaps more often than it should.

As Vlastmil guessed, my debug code output below shows nr_scanned is remaining
non-zero for quite a while.  This was on v5.2-rc6.

To help determine what is happening, I added a counter to determine how many
times should_continue_reclaim was returning true within one calling context
from shrink_node.  Every 1,000,000 calls, I set a 'debug flag' to get a little
more information about what is happening before the next call.  Here is output
from debug code with some comments about what the debug code is showing.

[ 1477.583859] ***should_continue_reclaim: setting debug_nr_scanned call 1000000
[ 1477.584698] shrink_inactive_list: nr_taken 1 =  isolate_lru_pages(1, ...

shrink_node calls shrink_node_memcg which calls shrink_list.  shrink_list
can increment sc->nr_scanned which is used to compute the value of nr_scanned
passed to should_continue_reclaim.  Here, we see that only one page is
isolated for potential reclaim.

[ 1477.585465] shrink_page_list: goto activate_locked 4

shrink_page_list determines that the page can not be reclaimed.  The following
code in shrink_page_list determines this.

                if (page_has_private(page)) {
                        if (!try_to_release_page(page, sc->gfp_mask))
{
/* Obviously, my debug code. */
if (sc->debug_nr_scanned)
  printk("shrink_page_list: goto activate_locked 4\n");
                                goto activate_locked;
}

[ 1477.586044] shrink_inactive_list: nr_reclaimed = 0

The bottom line is that page is not reclaimed.  But, it was 'scanned' so ..

[ 1477.586627] ***should_continue_reclaim: sc->nr_reclaimed 9 pages_for_compaction  1024
[ 1477.587546]            nr_reclaimed 0, nr_scanned 1

should_continue_reclaim is called with nr_reclaimed 0, nr_scanned 1

[ 1477.588124]            inactive_lru_pages 1 sc->nr_scanned 1000001

Since sc->nr_scanned is 1000001 for 1000001 calls, it looks like we scanned
one page each time.

Additional similar output without comments.

[ 1511.200515] ***should_continue_reclaim: setting debug_nr_scanned call 2000000
[ 1511.201581] shrink_inactive_list: nr_taken 1 =  isolate_lru_pages(1, ...
[ 1511.202615] shrink_page_list: goto activate_locked 4
[ 1511.203561] shrink_inactive_list: nr_reclaimed = 0
[ 1511.204422] ***should_continue_reclaim: sc->nr_reclaimed 10 pages_for_compaction  1024
[ 1511.205785]            nr_reclaimed 0, nr_scanned 1
[ 1511.206569]            inactive_lru_pages 1 sc->nr_scanned 2000001
[ 1543.982310] ***should_continue_reclaim: setting debug_nr_scanned call 3000000
[ 1543.983692] shrink_inactive_list: nr_taken 1 =  isolate_lru_pages(1, ...
[ 1543.984645] shrink_page_list: goto activate_locked 4
[ 1543.985405] shrink_inactive_list: nr_reclaimed = 0
[ 1543.986386] ***should_continue_reclaim: sc->nr_reclaimed 11 pages_for_compaction  1024
[ 1543.987662]            nr_reclaimed 0, nr_scanned 1
[ 1543.988423]            inactive_lru_pages 1 sc->nr_scanned 3000001
[ 1577.104923] ***should_continue_reclaim: setting debug_nr_scanned call 4000000
[ 1577.105857] shrink_inactive_list: nr_taken 1 =  isolate_lru_pages(1, ...
[ 1577.106752] shrink_page_list: goto activate_locked 4
[ 1577.107454] shrink_inactive_list: nr_reclaimed = 0
[ 1577.108163] ***should_continue_reclaim: sc->nr_reclaimed 11 pages_for_compaction  1024
[ 1577.109423]            nr_reclaimed 0, nr_scanned 1
[ 1577.110205]            inactive_lru_pages 1 sc->nr_scanned 4000001
[ 1614.075236] ***should_continue_reclaim: setting debug_nr_scanned call 5000000
[ 1614.076375] shrink_inactive_list: nr_taken 1 =  isolate_lru_pages(1, ...
[ 1614.077410] shrink_page_list: goto activate_locked 4
[ 1614.078210] shrink_inactive_list: nr_reclaimed = 0
[ 1614.078913] ***should_continue_reclaim: sc->nr_reclaimed 13 pages_for_compaction  1024
[ 1614.079907]            nr_reclaimed 0, nr_scanned 1
[ 1614.080496]            inactive_lru_pages 1 sc->nr_scanned 5000001
[ 1650.360466] ***should_continue_reclaim: setting debug_nr_scanned call 6000000
[ 1650.361342] shrink_inactive_list: nr_taken 1 =  isolate_lru_pages(1, ...
[ 1650.362147] shrink_page_list: goto activate_locked 4
[ 1650.362759] shrink_inactive_list: nr_reclaimed = 0
[ 1650.363685] ***should_continue_reclaim: sc->nr_reclaimed 13 pages_for_compaction  1024
[ 1650.364648]            nr_reclaimed 0, nr_scanned 1
[ 1650.365222]            inactive_lru_pages 0 sc->nr_scanned 6000001
[ 1685.061380] ***should_continue_reclaim: setting debug_nr_scanned call 7000000
[ 1685.062529] shrink_inactive_list: nr_taken 1 =  isolate_lru_pages(1, ...
[ 1685.063615] shrink_page_list: goto activate_locked 4
[ 1685.064439] shrink_inactive_list: nr_reclaimed = 0
[ 1685.065244] ***should_continue_reclaim: sc->nr_reclaimed 14 pages_for_compaction  1024
[ 1685.066536]            nr_reclaimed 0, nr_scanned 1
[ 1685.067356]            inactive_lru_pages 1 sc->nr_scanned 7000001
[ 1719.103176] ***should_continue_reclaim: setting debug_nr_scanned call 8000000
[ 1719.104206] shrink_inactive_list: nr_taken 1 =  isolate_lru_pages(1, ...
[ 1719.105117] shrink_page_list: goto activate_locked 4
[ 1719.105781] shrink_inactive_list: nr_reclaimed = 0
[ 1719.106475] ***should_continue_reclaim: sc->nr_reclaimed 15 pages_for_compaction  1024
[ 1719.107499]            nr_reclaimed 0, nr_scanned 1
[ 1719.108129]            inactive_lru_pages 1 sc->nr_scanned 8000001
[ 1743.911025] ***should_continue_reclaim: false after 8711717 calls !nr_reclaimed && !nr_scanned

Note that we do reclaim a 'few' pages along the way.  After 8711717 calls
should_continue_reclaim finally hits the condition where (!nr_reclaimed &&
!nr_scanned) and returns false.  We were stuck looping in shrink node for
about 5 minutes.

Any additional insight, suggestions for additional debug, etc. would be
appreciated.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Question] Should direct reclaim time be bounded?
  2019-06-28 18:20       ` Mike Kravetz
@ 2019-07-01  8:59         ` Mel Gorman
  2019-07-02  3:15           ` Mike Kravetz
  0 siblings, 1 reply; 17+ messages in thread
From: Mel Gorman @ 2019-07-01  8:59 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Vlastimil Babka, Michal Hocko, linux-mm, linux-kernel,
	Andrea Arcangeli, Johannes Weiner

On Fri, Jun 28, 2019 at 11:20:42AM -0700, Mike Kravetz wrote:
> On 4/24/19 7:35 AM, Vlastimil Babka wrote:
> > On 4/23/19 6:39 PM, Mike Kravetz wrote:
> >>> That being said, I do not think __GFP_RETRY_MAYFAIL is wrong here. It
> >>> looks like there is something wrong in the reclaim going on.
> >>
> >> Ok, I will start digging into that.  Just wanted to make sure before I got
> >> into it too deep.
> >>
> >> BTW - This is very easy to reproduce.  Just try to allocate more huge pages
> >> than will fit into memory.  I see this 'reclaim taking forever' behavior on
> >> v5.1-rc5-mmotm-2019-04-19-14-53.  Looks like it was there in v5.0 as well.
> > 
> > I'd suspect this in should_continue_reclaim():
> > 
> >         /* Consider stopping depending on scan and reclaim activity */
> >         if (sc->gfp_mask & __GFP_RETRY_MAYFAIL) {
> >                 /*
> >                  * For __GFP_RETRY_MAYFAIL allocations, stop reclaiming if the
> >                  * full LRU list has been scanned and we are still failing
> >                  * to reclaim pages. This full LRU scan is potentially
> >                  * expensive but a __GFP_RETRY_MAYFAIL caller really wants to succeed
> >                  */
> >                 if (!nr_reclaimed && !nr_scanned)
> >                         return false;
> > 
> > And that for some reason, nr_scanned never becomes zero. But it's hard
> > to figure out through all the layers of functions :/
> 
> I got back to looking into the direct reclaim/compaction stalls when
> trying to allocate huge pages.  As previously mentioned, the code is
> looping for a long time in shrink_node().  The routine
> should_continue_reclaim() returns true perhaps more often than it should.
> 
> As Vlastmil guessed, my debug code output below shows nr_scanned is remaining
> non-zero for quite a while.  This was on v5.2-rc6.
> 

I think it would be reasonable to have should_continue_reclaim allow an
exit if scanning at higher priority than DEF_PRIORITY - 2, nr_scanned is
less than SWAP_CLUSTER_MAX and no pages are being reclaimed.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Question] Should direct reclaim time be bounded?
  2019-07-01  8:59         ` Mel Gorman
@ 2019-07-02  3:15           ` Mike Kravetz
  2019-07-03  9:43             ` Mel Gorman
  2019-07-10 18:42             ` Mike Kravetz
  0 siblings, 2 replies; 17+ messages in thread
From: Mike Kravetz @ 2019-07-02  3:15 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Vlastimil Babka, Michal Hocko, linux-mm, linux-kernel,
	Andrea Arcangeli, Johannes Weiner

On 7/1/19 1:59 AM, Mel Gorman wrote:
> On Fri, Jun 28, 2019 at 11:20:42AM -0700, Mike Kravetz wrote:
>> On 4/24/19 7:35 AM, Vlastimil Babka wrote:
>>> On 4/23/19 6:39 PM, Mike Kravetz wrote:
>>>>> That being said, I do not think __GFP_RETRY_MAYFAIL is wrong here. It
>>>>> looks like there is something wrong in the reclaim going on.
>>>>
>>>> Ok, I will start digging into that.  Just wanted to make sure before I got
>>>> into it too deep.
>>>>
>>>> BTW - This is very easy to reproduce.  Just try to allocate more huge pages
>>>> than will fit into memory.  I see this 'reclaim taking forever' behavior on
>>>> v5.1-rc5-mmotm-2019-04-19-14-53.  Looks like it was there in v5.0 as well.
>>>
>>> I'd suspect this in should_continue_reclaim():
>>>
>>>         /* Consider stopping depending on scan and reclaim activity */
>>>         if (sc->gfp_mask & __GFP_RETRY_MAYFAIL) {
>>>                 /*
>>>                  * For __GFP_RETRY_MAYFAIL allocations, stop reclaiming if the
>>>                  * full LRU list has been scanned and we are still failing
>>>                  * to reclaim pages. This full LRU scan is potentially
>>>                  * expensive but a __GFP_RETRY_MAYFAIL caller really wants to succeed
>>>                  */
>>>                 if (!nr_reclaimed && !nr_scanned)
>>>                         return false;
>>>
>>> And that for some reason, nr_scanned never becomes zero. But it's hard
>>> to figure out through all the layers of functions :/
>>
>> I got back to looking into the direct reclaim/compaction stalls when
>> trying to allocate huge pages.  As previously mentioned, the code is
>> looping for a long time in shrink_node().  The routine
>> should_continue_reclaim() returns true perhaps more often than it should.
>>
>> As Vlastmil guessed, my debug code output below shows nr_scanned is remaining
>> non-zero for quite a while.  This was on v5.2-rc6.
>>
> 
> I think it would be reasonable to have should_continue_reclaim allow an
> exit if scanning at higher priority than DEF_PRIORITY - 2, nr_scanned is
> less than SWAP_CLUSTER_MAX and no pages are being reclaimed.

Thanks Mel,

I added such a check to should_continue_reclaim.  However, it does not
address the issue I am seeing.  In that do-while loop in shrink_node,
the scan priority is not raised (priority--).  We can enter the loop
with priority == DEF_PRIORITY and continue to loop for minutes as seen
in my previous debug output.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Question] Should direct reclaim time be bounded?
  2019-07-02  3:15           ` Mike Kravetz
@ 2019-07-03  9:43             ` Mel Gorman
  2019-07-03 23:54               ` Mike Kravetz
  2019-07-10 18:42             ` Mike Kravetz
  1 sibling, 1 reply; 17+ messages in thread
From: Mel Gorman @ 2019-07-03  9:43 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Mel Gorman, Vlastimil Babka, Michal Hocko, linux-mm,
	linux-kernel, Andrea Arcangeli, Johannes Weiner

On Mon, Jul 01, 2019 at 08:15:50PM -0700, Mike Kravetz wrote:
> On 7/1/19 1:59 AM, Mel Gorman wrote:
> > On Fri, Jun 28, 2019 at 11:20:42AM -0700, Mike Kravetz wrote:
> >> On 4/24/19 7:35 AM, Vlastimil Babka wrote:
> >>> On 4/23/19 6:39 PM, Mike Kravetz wrote:
> >>>>> That being said, I do not think __GFP_RETRY_MAYFAIL is wrong here. It
> >>>>> looks like there is something wrong in the reclaim going on.
> >>>>
> >>>> Ok, I will start digging into that.  Just wanted to make sure before I got
> >>>> into it too deep.
> >>>>
> >>>> BTW - This is very easy to reproduce.  Just try to allocate more huge pages
> >>>> than will fit into memory.  I see this 'reclaim taking forever' behavior on
> >>>> v5.1-rc5-mmotm-2019-04-19-14-53.  Looks like it was there in v5.0 as well.
> >>>
> >>> I'd suspect this in should_continue_reclaim():
> >>>
> >>>         /* Consider stopping depending on scan and reclaim activity */
> >>>         if (sc->gfp_mask & __GFP_RETRY_MAYFAIL) {
> >>>                 /*
> >>>                  * For __GFP_RETRY_MAYFAIL allocations, stop reclaiming if the
> >>>                  * full LRU list has been scanned and we are still failing
> >>>                  * to reclaim pages. This full LRU scan is potentially
> >>>                  * expensive but a __GFP_RETRY_MAYFAIL caller really wants to succeed
> >>>                  */
> >>>                 if (!nr_reclaimed && !nr_scanned)
> >>>                         return false;
> >>>
> >>> And that for some reason, nr_scanned never becomes zero. But it's hard
> >>> to figure out through all the layers of functions :/
> >>
> >> I got back to looking into the direct reclaim/compaction stalls when
> >> trying to allocate huge pages.  As previously mentioned, the code is
> >> looping for a long time in shrink_node().  The routine
> >> should_continue_reclaim() returns true perhaps more often than it should.
> >>
> >> As Vlastmil guessed, my debug code output below shows nr_scanned is remaining
> >> non-zero for quite a while.  This was on v5.2-rc6.
> >>
> > 
> > I think it would be reasonable to have should_continue_reclaim allow an
> > exit if scanning at higher priority than DEF_PRIORITY - 2, nr_scanned is
> > less than SWAP_CLUSTER_MAX and no pages are being reclaimed.
> 
> Thanks Mel,
> 
> I added such a check to should_continue_reclaim.  However, it does not
> address the issue I am seeing.  In that do-while loop in shrink_node,
> the scan priority is not raised (priority--).  We can enter the loop
> with priority == DEF_PRIORITY and continue to loop for minutes as seen
> in my previous debug output.
> 

Indeed. I'm getting knocked offline shortly so I didn't give this the
time it deserves but it appears that part of this problem is
hugetlb-specific when one node is full and can enter into this continual
loop due to __GFP_RETRY_MAYFAIL requiring both nr_reclaimed and
nr_scanned to be zero.

Have you considered one of the following as an option?

1. Always use the on-stack nodes_allowed in __nr_hugepages_store_common
   and copy nodes_states if necessary. Add a bool parameter to
   alloc_pool_huge_page that is true when called from set_max_huge_pages.
   If an allocation from alloc_fresh_huge_page, clear the failing node
   from the mask so it's not retried, bail if the mask is empty. The
   consequences are that round-robin allocation of huge pages will be
   different if a node failed to allocate for transient reasons.

2. Alter the condition in should_continue_reclaim for
   __GFP_RETRY_MAYFAIL to consider if nr_scanned < SWAP_CLUSTER_MAX.
   Either raise priority (will interfere with kswapd though) or
   bail entirely.  Consequences may be that other __GFP_RETRY_MAYFAIL
   allocations do not want this behaviour. There are a lot of users.

3. Move where __GFP_RETRY_MAYFAIL is set in a gfp_mask in mm/hugetlb.c.
   Strip the flag if an allocation fails on a node. Consequences are
   that setting the required number of huge pages is more likely to
   return without all the huge pages set.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Question] Should direct reclaim time be bounded?
  2019-07-03  9:43             ` Mel Gorman
@ 2019-07-03 23:54               ` Mike Kravetz
  2019-07-04 11:09                 ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Kravetz @ 2019-07-03 23:54 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Mel Gorman, Vlastimil Babka, Michal Hocko, linux-mm,
	linux-kernel, Andrea Arcangeli, Johannes Weiner

On 7/3/19 2:43 AM, Mel Gorman wrote:
> Indeed. I'm getting knocked offline shortly so I didn't give this the
> time it deserves but it appears that part of this problem is
> hugetlb-specific when one node is full and can enter into this continual
> loop due to __GFP_RETRY_MAYFAIL requiring both nr_reclaimed and
> nr_scanned to be zero.

Yes, I am not aware of any other large order allocations consistently made
with __GFP_RETRY_MAYFAIL.  But, I did not look too closely.  Michal believes
that hugetlb pages allocations should use __GFP_RETRY_MAYFAIL.

> Have you considered one of the following as an option?
> 
> 1. Always use the on-stack nodes_allowed in __nr_hugepages_store_common
>    and copy nodes_states if necessary. Add a bool parameter to
>    alloc_pool_huge_page that is true when called from set_max_huge_pages.
>    If an allocation from alloc_fresh_huge_page, clear the failing node
>    from the mask so it's not retried, bail if the mask is empty. The
>    consequences are that round-robin allocation of huge pages will be
>    different if a node failed to allocate for transient reasons.

That seems to be a more aggressive form of 3 below.

> 2. Alter the condition in should_continue_reclaim for
>    __GFP_RETRY_MAYFAIL to consider if nr_scanned < SWAP_CLUSTER_MAX.
>    Either raise priority (will interfere with kswapd though) or
>    bail entirely.  Consequences may be that other __GFP_RETRY_MAYFAIL
>    allocations do not want this behaviour. There are a lot of users.

Due to high number of users, I am avoiding such a change.  It would be
hard to validate that such a change does not impact other users.

> 3. Move where __GFP_RETRY_MAYFAIL is set in a gfp_mask in mm/hugetlb.c.
>    Strip the flag if an allocation fails on a node. Consequences are
>    that setting the required number of huge pages is more likely to
>    return without all the huge pages set.

We are actually using a form of this in our distro kernel.  It works quite
well on the older (4.11 based) distro kernel.  My plan was to push this
upstream.  However, when I tested this on recent upstream kernels, I
encountered long stalls associated with the first __GFP_RETRY_MAYFAIL
allocation failure.  That is what prompted me to ask this queastion/start
this thread.  The distro kernel would see stalls taking tens of seconds,
upstream would see stalls of several minutes.  Much has changed since 4.11,
so I was trying to figure out what might be causing this change in behavior.

BTW, here is the patch I was testing.  It actually has additional code to
switch between __GFP_RETRY_MAYFAIL and __GFP_NORETRY and back to hopefully
take into account transient conditions.

From 528c52397301f02acb614c610bd65f0f9a107481 Mon Sep 17 00:00:00 2001
From: Mike Kravetz <mike.kravetz@oracle.com>
Date: Wed, 3 Jul 2019 13:36:24 -0700
Subject: [PATCH] hugetlbfs: don't retry when pool page allocations start to
 fail

When allocating hugetlbfs pool pages via /proc/sys/vm/nr_hugepages,
the pages will be interleaved between all nodes of the system.  If
nodes are not equal, it is quite possible for one node to fill up
before the others.  When this happens, the code still attempts to
allocate pages from the full node.  This results in calls to direct
reclaim and compaction which slow things down considerably.

When allocating pool pages, note the state of the previous allocation
for each node.  If previous allocation failed, do not use the
aggressive retry algorithm on successive attempts.  The allocation
will still succeed if there is memory available, but it will not try
as hard to free up memory.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 87 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 77 insertions(+), 10 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ede7e7f5d1ab..f3c50344a9b4 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1405,12 +1405,27 @@ pgoff_t __basepage_index(struct page *page)
 }
 
 static struct page *alloc_buddy_huge_page(struct hstate *h,
-		gfp_t gfp_mask, int nid, nodemask_t *nmask)
+		gfp_t gfp_mask, int nid, nodemask_t *nmask,
+		nodemask_t *node_alloc_noretry)
 {
 	int order = huge_page_order(h);
 	struct page *page;
+	bool alloc_try_hard = true;
 
-	gfp_mask |= __GFP_COMP|__GFP_RETRY_MAYFAIL|__GFP_NOWARN;
+	/*
+	 * By default we always try hard to allocate the page with
+	 * __GFP_RETRY_MAYFAIL flag.  However, if we are allocating pages in
+	 * a loop (to adjust global huge page counts) and previous allocation
+	 * failed, do not continue to try hard on the same node.  Use the
+	 * node_alloc_noretry bitmap to manage this state information.
+	 */
+	if (node_alloc_noretry && node_isset(nid, *node_alloc_noretry))
+		alloc_try_hard = false;
+	gfp_mask |= __GFP_COMP|__GFP_NOWARN;
+	if (alloc_try_hard)
+		gfp_mask |= __GFP_RETRY_MAYFAIL;
+	else
+		gfp_mask |= __GFP_NORETRY;
 	if (nid == NUMA_NO_NODE)
 		nid = numa_mem_id();
 	page = __alloc_pages_nodemask(gfp_mask, order, nid, nmask);
@@ -1419,6 +1434,22 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
 	else
 		__count_vm_event(HTLB_BUDDY_PGALLOC_FAIL);
 
+	/*
+	 * If we did not specify __GFP_RETRY_MAYFAIL, but still got a page this
+	 * indicates an overall state change.  Clear bit so that we resume
+	 * normal 'try hard' allocations.
+	 */
+	if (node_alloc_noretry && page && !alloc_try_hard)
+		node_clear(nid, *node_alloc_noretry);
+
+	/*
+	 * If we tried hard to get a page but failed, set bit so that
+	 * subsequent attempts will not try as hard until there is an
+	 * overall state change.
+	 */
+	if (node_alloc_noretry && !page && alloc_try_hard)
+		node_set(nid, *node_alloc_noretry);
+
 	return page;
 }
 
@@ -1427,7 +1458,8 @@ static struct page *alloc_buddy_huge_page(struct hstate *h,
  * should use this function to get new hugetlb pages
  */
 static struct page *alloc_fresh_huge_page(struct hstate *h,
-		gfp_t gfp_mask, int nid, nodemask_t *nmask)
+		gfp_t gfp_mask, int nid, nodemask_t *nmask,
+		nodemask_t *node_alloc_noretry)
 {
 	struct page *page;
 
@@ -1435,7 +1467,7 @@ static struct page *alloc_fresh_huge_page(struct hstate *h,
 		page = alloc_gigantic_page(h, gfp_mask, nid, nmask);
 	else
 		page = alloc_buddy_huge_page(h, gfp_mask,
-				nid, nmask);
+				nid, nmask, node_alloc_noretry);
 	if (!page)
 		return NULL;
 
@@ -1450,14 +1482,16 @@ static struct page *alloc_fresh_huge_page(struct hstate *h,
  * Allocates a fresh page to the hugetlb allocator pool in the node interleaved
  * manner.
  */
-static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed)
+static int alloc_pool_huge_page(struct hstate *h, nodemask_t *nodes_allowed,
+				nodemask_t *node_alloc_noretry)
 {
 	struct page *page;
 	int nr_nodes, node;
 	gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
 
 	for_each_node_mask_to_alloc(h, nr_nodes, node, nodes_allowed) {
-		page = alloc_fresh_huge_page(h, gfp_mask, node, nodes_allowed);
+		page = alloc_fresh_huge_page(h, gfp_mask, node, nodes_allowed,
+						node_alloc_noretry);
 		if (page)
 			break;
 	}
@@ -1601,7 +1635,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
 		goto out_unlock;
 	spin_unlock(&hugetlb_lock);
 
-	page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask);
+	page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL);
 	if (!page)
 		return NULL;
 
@@ -1637,7 +1671,7 @@ struct page *alloc_migrate_huge_page(struct hstate *h, gfp_t gfp_mask,
 	if (hstate_is_gigantic(h))
 		return NULL;
 
-	page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask);
+	page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL);
 	if (!page)
 		return NULL;
 
@@ -2207,13 +2241,31 @@ static void __init gather_bootmem_prealloc(void)
 static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 {
 	unsigned long i;
+	nodemask_t *node_alloc_noretry;
+
+	if (!hstate_is_gigantic(h)) {
+		/*
+		 * bit mask controlling how hard we retry per-node
+		 * allocations.
+		 */
+		node_alloc_noretry = kmalloc(sizeof(*node_alloc_noretry),
+						GFP_KERNEL | __GFP_NORETRY);
+	} else {
+		/* allocations done at boot time */
+		node_alloc_noretry = NULL;
+	}
+
+	/* bit mask controlling how hard we retry per-node allocations */
+	if (node_alloc_noretry)
+		nodes_clear(*node_alloc_noretry);
 
 	for (i = 0; i < h->max_huge_pages; ++i) {
 		if (hstate_is_gigantic(h)) {
 			if (!alloc_bootmem_huge_page(h))
 				break;
 		} else if (!alloc_pool_huge_page(h,
-					 &node_states[N_MEMORY]))
+					 &node_states[N_MEMORY],
+					 node_alloc_noretry))
 			break;
 		cond_resched();
 	}
@@ -2225,6 +2277,9 @@ static void __init hugetlb_hstate_alloc_pages(struct hstate *h)
 			h->max_huge_pages, buf, i);
 		h->max_huge_pages = i;
 	}
+
+	if (node_alloc_noretry)
+		kfree(node_alloc_noretry);
 }
 
 static void __init hugetlb_init_hstates(void)
@@ -2323,6 +2378,12 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 			      nodemask_t *nodes_allowed)
 {
 	unsigned long min_count, ret;
+	NODEMASK_ALLOC(nodemask_t, node_alloc_noretry,
+						GFP_KERNEL | __GFP_NORETRY);
+
+	/* bit mask controlling how hard we retry per-node allocations */
+	if (node_alloc_noretry)
+		nodes_clear(*node_alloc_noretry);
 
 	spin_lock(&hugetlb_lock);
 
@@ -2356,6 +2417,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	if (hstate_is_gigantic(h) && !IS_ENABLED(CONFIG_CONTIG_ALLOC)) {
 		if (count > persistent_huge_pages(h)) {
 			spin_unlock(&hugetlb_lock);
+			if (node_alloc_noretry)
+				NODEMASK_FREE(node_alloc_noretry);
 			return -EINVAL;
 		}
 		/* Fall through to decrease pool */
@@ -2388,7 +2451,8 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 		/* yield cpu to avoid soft lockup */
 		cond_resched();
 
-		ret = alloc_pool_huge_page(h, nodes_allowed);
+		ret = alloc_pool_huge_page(h, nodes_allowed,
+						node_alloc_noretry);
 		spin_lock(&hugetlb_lock);
 		if (!ret)
 			goto out;
@@ -2429,6 +2493,9 @@ static int set_max_huge_pages(struct hstate *h, unsigned long count, int nid,
 	h->max_huge_pages = persistent_huge_pages(h);
 	spin_unlock(&hugetlb_lock);
 
+	if (node_alloc_noretry)
+		NODEMASK_FREE(node_alloc_noretry);
+
 	return 0;
 }
 
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [Question] Should direct reclaim time be bounded?
  2019-07-03 23:54               ` Mike Kravetz
@ 2019-07-04 11:09                 ` Michal Hocko
  2019-07-04 15:11                   ` Mike Kravetz
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2019-07-04 11:09 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Mel Gorman, Mel Gorman, Vlastimil Babka, linux-mm, linux-kernel,
	Andrea Arcangeli, Johannes Weiner

On Wed 03-07-19 16:54:35, Mike Kravetz wrote:
> On 7/3/19 2:43 AM, Mel Gorman wrote:
> > Indeed. I'm getting knocked offline shortly so I didn't give this the
> > time it deserves but it appears that part of this problem is
> > hugetlb-specific when one node is full and can enter into this continual
> > loop due to __GFP_RETRY_MAYFAIL requiring both nr_reclaimed and
> > nr_scanned to be zero.
> 
> Yes, I am not aware of any other large order allocations consistently made
> with __GFP_RETRY_MAYFAIL.  But, I did not look too closely.  Michal believes
> that hugetlb pages allocations should use __GFP_RETRY_MAYFAIL.

Yes. The argument is that this is controlable by an admin and failures
should be prevented as much as possible. I didn't get to understand
should_continue_reclaim part of the problem but I have a strong feeling
that __GFP_RETRY_MAYFAIL handling at that layer is not correct. What
happens if it is simply removed and we rely only on the retry mechanism
from the page allocator instead? Does the success rate is reduced
considerably?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Question] Should direct reclaim time be bounded?
  2019-07-04 11:09                 ` Michal Hocko
@ 2019-07-04 15:11                   ` Mike Kravetz
  0 siblings, 0 replies; 17+ messages in thread
From: Mike Kravetz @ 2019-07-04 15:11 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mel Gorman, Mel Gorman, Vlastimil Babka, linux-mm, linux-kernel,
	Andrea Arcangeli, Johannes Weiner

On 7/4/19 4:09 AM, Michal Hocko wrote:
> On Wed 03-07-19 16:54:35, Mike Kravetz wrote:
>> On 7/3/19 2:43 AM, Mel Gorman wrote:
>>> Indeed. I'm getting knocked offline shortly so I didn't give this the
>>> time it deserves but it appears that part of this problem is
>>> hugetlb-specific when one node is full and can enter into this continual
>>> loop due to __GFP_RETRY_MAYFAIL requiring both nr_reclaimed and
>>> nr_scanned to be zero.
>>
>> Yes, I am not aware of any other large order allocations consistently made
>> with __GFP_RETRY_MAYFAIL.  But, I did not look too closely.  Michal believes
>> that hugetlb pages allocations should use __GFP_RETRY_MAYFAIL.
> 
> Yes. The argument is that this is controlable by an admin and failures
> should be prevented as much as possible. I didn't get to understand
> should_continue_reclaim part of the problem but I have a strong feeling
> that __GFP_RETRY_MAYFAIL handling at that layer is not correct. What
> happens if it is simply removed and we rely only on the retry mechanism
> from the page allocator instead? Does the success rate is reduced
> considerably?

It certainly will be reduced.  I 'think' it will be hard to predict how
much it will be reduced as this will depend on the state of memory usage
and fragmentation at the time of the attempt.

I can try to measure this, but I will be a few days due to U.S. holiday.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Question] Should direct reclaim time be bounded?
  2019-07-02  3:15           ` Mike Kravetz
  2019-07-03  9:43             ` Mel Gorman
@ 2019-07-10 18:42             ` Mike Kravetz
  2019-07-10 19:44               ` Michal Hocko
  1 sibling, 1 reply; 17+ messages in thread
From: Mike Kravetz @ 2019-07-10 18:42 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Vlastimil Babka, Michal Hocko, Mel Gorman, linux-mm,
	linux-kernel, Johannes Weiner

On 7/7/19 10:19 PM, Hillf Danton wrote:
> On Mon, 01 Jul 2019 20:15:51 -0700 Mike Kravetz wrote:
>> On 7/1/19 1:59 AM, Mel Gorman wrote:
>>>
>>> I think it would be reasonable to have should_continue_reclaim allow an
>>> exit if scanning at higher priority than DEF_PRIORITY - 2, nr_scanned is
>>> less than SWAP_CLUSTER_MAX and no pages are being reclaimed.
>>
>> Thanks Mel,
>>
>> I added such a check to should_continue_reclaim.  However, it does not
>> address the issue I am seeing.  In that do-while loop in shrink_node,
>> the scan priority is not raised (priority--).  We can enter the loop
>> with priority == DEF_PRIORITY and continue to loop for minutes as seen
>> in my previous debug output.
>>
> Does it help raise prioity in your case?

Thanks Hillf,  sorry for delay in responding I have been AFK.

I am not sure if you wanted to try this somehow in addition to Mel's
suggestion, or alone.

Unfortunately, such a change actually causes worse behavior.

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2543,11 +2543,18 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
>  	unsigned long pages_for_compaction;
>  	unsigned long inactive_lru_pages;
>  	int z;
> +	bool costly_fg_reclaim = false;
>  
>  	/* If not in reclaim/compaction mode, stop */
>  	if (!in_reclaim_compaction(sc))
>  		return false;
>  
> +	/* Let compact determine what to do for high order allocators */
> +	costly_fg_reclaim = sc->order > PAGE_ALLOC_COSTLY_ORDER &&
> +				!current_is_kswapd();
> +	if (costly_fg_reclaim)
> +		goto check_compact;

This goto makes us skip the 'if (!nr_reclaimed && !nr_scanned)' test.

> +
>  	/* Consider stopping depending on scan and reclaim activity */
>  	if (sc->gfp_mask & __GFP_RETRY_MAYFAIL) {
>  		/*
> @@ -2571,6 +2578,7 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
>  			return false;
>  	}
>  
> +check_compact:
>  	/*
>  	 * If we have not reclaimed enough pages for compaction and the
>  	 * inactive lists are large enough, continue reclaiming

It is quite easy to hit the condition where:
nr_reclaimed == 0  && nr_scanned == 0 is true, but we skip the previous test

and the compaction check:
sc->nr_reclaimed < pages_for_compaction &&
	inactive_lru_pages > pages_for_compaction

is true, so we return true before the below check of costly_fg_reclaim

> @@ -2583,6 +2591,9 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
>  			inactive_lru_pages > pages_for_compaction)
>  		return true;
>  
> +	if (costly_fg_reclaim)
> +		return false;
> +
>  	/* If compaction would go ahead or the allocation would succeed, stop */
>  	for (z = 0; z <= sc->reclaim_idx; z++) {
>  		struct zone *zone = &pgdat->node_zones[z];
> --
> 

As Michal suggested, I'm going to do some testing to see what impact
dropping the __GFP_RETRY_MAYFAIL flag for these huge page allocations
will have on the number of pages allocated.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Question] Should direct reclaim time be bounded?
  2019-07-10 18:42             ` Mike Kravetz
@ 2019-07-10 19:44               ` Michal Hocko
  2019-07-10 23:36                 ` Mike Kravetz
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2019-07-10 19:44 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Hillf Danton, Vlastimil Babka, Mel Gorman, linux-mm,
	linux-kernel, Johannes Weiner

On Wed 10-07-19 11:42:40, Mike Kravetz wrote:
[...]
> As Michal suggested, I'm going to do some testing to see what impact
> dropping the __GFP_RETRY_MAYFAIL flag for these huge page allocations
> will have on the number of pages allocated.

Just to clarify. I didn't mean to drop __GFP_RETRY_MAYFAIL from the
allocation request. I meant to drop the special casing of the flag in
should_continue_reclaim. I really have hard time to argue for this
special casing TBH. The flag is meant to retry harder but that shouldn't
be reduced to a single reclaim attempt because that alone doesn't really
help much with the high order allocation. It is more about compaction to
be retried harder.
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Question] Should direct reclaim time be bounded?
  2019-07-10 19:44               ` Michal Hocko
@ 2019-07-10 23:36                 ` Mike Kravetz
  2019-07-11  7:12                   ` Michal Hocko
  0 siblings, 1 reply; 17+ messages in thread
From: Mike Kravetz @ 2019-07-10 23:36 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Hillf Danton, Vlastimil Babka, Mel Gorman, linux-mm,
	linux-kernel, Johannes Weiner

On 7/10/19 12:44 PM, Michal Hocko wrote:
> On Wed 10-07-19 11:42:40, Mike Kravetz wrote:
> [...]
>> As Michal suggested, I'm going to do some testing to see what impact
>> dropping the __GFP_RETRY_MAYFAIL flag for these huge page allocations
>> will have on the number of pages allocated.
> 
> Just to clarify. I didn't mean to drop __GFP_RETRY_MAYFAIL from the
> allocation request. I meant to drop the special casing of the flag in
> should_continue_reclaim. I really have hard time to argue for this
> special casing TBH. The flag is meant to retry harder but that shouldn't
> be reduced to a single reclaim attempt because that alone doesn't really
> help much with the high order allocation. It is more about compaction to
> be retried harder.

Thanks Michal.  That is indeed what you suggested earlier.  I remembered
incorrectly.  Sorry.

Removing the special casing for __GFP_RETRY_MAYFAIL in should_continue_reclaim
implies that it will return false if nothing was reclaimed (nr_reclaimed == 0)
in the previous pass.

When I make such a modification and test, I see long stalls as a result
of should_compact_retry returning true too often.  On a system I am currently
testing, should_compact_retry has returned true 36000000 times.  My guess
is that this may stall forever.  Vlastmil previously asked about this behavior,
so I am capturing the reason.  Like before [1], should_compact_retry is
returning true mostly because compaction_withdrawn() returns COMPACT_DEFERRED.

Total 36000000
      35437500	COMPACT_DEFERRED
        562500  COMPACT_PARTIAL_SKIPPED


[1] https://lkml.org/lkml/2019/6/5/643
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Question] Should direct reclaim time be bounded?
  2019-07-10 23:36                 ` Mike Kravetz
@ 2019-07-11  7:12                   ` Michal Hocko
  2019-07-12  9:49                     ` Mel Gorman
  0 siblings, 1 reply; 17+ messages in thread
From: Michal Hocko @ 2019-07-11  7:12 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Hillf Danton, Vlastimil Babka, Mel Gorman, linux-mm,
	linux-kernel, Johannes Weiner

On Wed 10-07-19 16:36:58, Mike Kravetz wrote:
> On 7/10/19 12:44 PM, Michal Hocko wrote:
> > On Wed 10-07-19 11:42:40, Mike Kravetz wrote:
> > [...]
> >> As Michal suggested, I'm going to do some testing to see what impact
> >> dropping the __GFP_RETRY_MAYFAIL flag for these huge page allocations
> >> will have on the number of pages allocated.
> > 
> > Just to clarify. I didn't mean to drop __GFP_RETRY_MAYFAIL from the
> > allocation request. I meant to drop the special casing of the flag in
> > should_continue_reclaim. I really have hard time to argue for this
> > special casing TBH. The flag is meant to retry harder but that shouldn't
> > be reduced to a single reclaim attempt because that alone doesn't really
> > help much with the high order allocation. It is more about compaction to
> > be retried harder.
> 
> Thanks Michal.  That is indeed what you suggested earlier.  I remembered
> incorrectly.  Sorry.
> 
> Removing the special casing for __GFP_RETRY_MAYFAIL in should_continue_reclaim
> implies that it will return false if nothing was reclaimed (nr_reclaimed == 0)
> in the previous pass.
> 
> When I make such a modification and test, I see long stalls as a result
> of should_compact_retry returning true too often.  On a system I am currently
> testing, should_compact_retry has returned true 36000000 times.  My guess
> is that this may stall forever.  Vlastmil previously asked about this behavior,
> so I am capturing the reason.  Like before [1], should_compact_retry is
> returning true mostly because compaction_withdrawn() returns COMPACT_DEFERRED.

This smells like a problem to me. But somebody more familiar with
compaction should comment.

> 
> Total 36000000
>       35437500	COMPACT_DEFERRED
>         562500  COMPACT_PARTIAL_SKIPPED
> 
> 
> [1] https://lkml.org/lkml/2019/6/5/643
> -- 
> Mike Kravetz

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Question] Should direct reclaim time be bounded?
  2019-07-11  7:12                   ` Michal Hocko
@ 2019-07-12  9:49                     ` Mel Gorman
  0 siblings, 0 replies; 17+ messages in thread
From: Mel Gorman @ 2019-07-12  9:49 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Mike Kravetz, Hillf Danton, Vlastimil Babka, linux-mm,
	linux-kernel, Johannes Weiner

On Thu, Jul 11, 2019 at 09:12:45AM +0200, Michal Hocko wrote:
> On Wed 10-07-19 16:36:58, Mike Kravetz wrote:
> > On 7/10/19 12:44 PM, Michal Hocko wrote:
> > > On Wed 10-07-19 11:42:40, Mike Kravetz wrote:
> > > [...]
> > >> As Michal suggested, I'm going to do some testing to see what impact
> > >> dropping the __GFP_RETRY_MAYFAIL flag for these huge page allocations
> > >> will have on the number of pages allocated.
> > > 
> > > Just to clarify. I didn't mean to drop __GFP_RETRY_MAYFAIL from the
> > > allocation request. I meant to drop the special casing of the flag in
> > > should_continue_reclaim. I really have hard time to argue for this
> > > special casing TBH. The flag is meant to retry harder but that shouldn't
> > > be reduced to a single reclaim attempt because that alone doesn't really
> > > help much with the high order allocation. It is more about compaction to
> > > be retried harder.
> > 
> > Thanks Michal.  That is indeed what you suggested earlier.  I remembered
> > incorrectly.  Sorry.
> > 
> > Removing the special casing for __GFP_RETRY_MAYFAIL in should_continue_reclaim
> > implies that it will return false if nothing was reclaimed (nr_reclaimed == 0)
> > in the previous pass.
> > 
> > When I make such a modification and test, I see long stalls as a result
> > of should_compact_retry returning true too often.  On a system I am currently
> > testing, should_compact_retry has returned true 36000000 times.  My guess
> > is that this may stall forever.  Vlastmil previously asked about this behavior,
> > so I am capturing the reason.  Like before [1], should_compact_retry is
> > returning true mostly because compaction_withdrawn() returns COMPACT_DEFERRED.
> 
> This smells like a problem to me. But somebody more familiar with
> compaction should comment.
> 

Examine in should_compact_retry if it's retrying because
compaction_zonelist_suitable is true. Looking at it now, it would not
necessarily do the right thing because any non-skipped zone would make
it eligible which is too strong a condition as COMPACT_SKIPPED is not
reliably set. If that function is the case, it would be reasonable
remove "ret = compaction_zonelist_suitable(ac, order, alloc_flags);" and
the implementation of compaction_zonelist_suitable entirely as part of
your fix.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [Question] Should direct reclaim time be bounded?
       [not found] <20190712054732.7264-1-hdanton@sina.com>
@ 2019-07-13  1:11 ` Mike Kravetz
  0 siblings, 0 replies; 17+ messages in thread
From: Mike Kravetz @ 2019-07-13  1:11 UTC (permalink / raw)
  To: Hillf Danton, Mel Gorman
  Cc: Vlastimil Babka, Michal Hocko, linux-mm, linux-kernel, Johannes Weiner

On 7/11/19 10:47 PM, Hillf Danton wrote:
> 
> On Thu, 11 Jul 2019 02:42:56 +0800 Mike Kravetz wrote:
>>
>> It is quite easy to hit the condition where:
>> nr_reclaimed == 0  && nr_scanned == 0 is true, but we skip the previous test
>>
> Then skipping check of __GFP_RETRY_MAYFAIL makes no sense in your case.
> It is restored in respin below.
> 
>> and the compaction check:
>> sc->nr_reclaimed < pages_for_compaction &&
>> 	inactive_lru_pages > pages_for_compaction
>> is true, so we return true before the below check of costly_fg_reclaim
>>
> This check is placed after COMPACT_SUCCESS; the latter is used to
> replace sc->nr_reclaimed < pages_for_compaction.
> 
> And dryrun detection is added based on the result of last round of
> shrinking of inactive pages, particularly when their number is large
> enough.
> 

Thanks Hillf.

This change does appear to eliminate the issue with stalls by
should_continue_reclaim returning true too often.  I need to think
some more about exactly what is impacted with the change.

With this change, the problem moves to compaction with should_compact_retry
returning true too often.  It is the same behavior seem when I simply removed
the __GFP_RETRY_MAYFAIL special casing in should_continue_reclaim.

At Mel's suggestion I removed the compaction_zonelist_suitable() call
from should_compact_retry.  This eliminated the compaction stalls.  Thanks
Mel.

With both changes, stalls appear to be eliminated.  This is promising.
I'll try to refine these approaches and continue testing.
-- 
Mike Kravetz

> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -2571,18 +2571,6 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
>  			return false;
>  	}
> 
> -	/*
> -	 * If we have not reclaimed enough pages for compaction and the
> -	 * inactive lists are large enough, continue reclaiming
> -	 */
> -	pages_for_compaction = compact_gap(sc->order);
> -	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
> -	if (get_nr_swap_pages() > 0)
> -		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
> -	if (sc->nr_reclaimed < pages_for_compaction &&
> -			inactive_lru_pages > pages_for_compaction)
> -		return true;
> -
>  	/* If compaction would go ahead or the allocation would succeed, stop */
>  	for (z = 0; z <= sc->reclaim_idx; z++) {
>  		struct zone *zone = &pgdat->node_zones[z];
> @@ -2598,7 +2586,21 @@ static inline bool should_continue_reclaim(struct pglist_data *pgdat,
>  			;
>  		}
>  	}
> -	return true;
> +
> +	/*
> +	 * If we have not reclaimed enough pages for compaction and the
> +	 * inactive lists are large enough, continue reclaiming
> +	 */
> +	pages_for_compaction = compact_gap(sc->order);
> +	inactive_lru_pages = node_page_state(pgdat, NR_INACTIVE_FILE);
> +	if (get_nr_swap_pages() > 0)
> +		inactive_lru_pages += node_page_state(pgdat, NR_INACTIVE_ANON);
> +
> +	return inactive_lru_pages > pages_for_compaction &&
> +		/*
> +		 * avoid dryrun with plenty of inactive pages
> +		 */
> +		nr_scanned && nr_reclaimed;
>  }
> 
>  static bool pgdat_memcg_congested(pg_data_t *pgdat, struct mem_cgroup *memcg)
> --
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2019-07-13  1:11 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-04-23  4:07 [Question] Should direct reclaim time be bounded? Mike Kravetz
2019-04-23  7:19 ` Michal Hocko
2019-04-23 16:39   ` Mike Kravetz
2019-04-24 14:35     ` Vlastimil Babka
2019-06-28 18:20       ` Mike Kravetz
2019-07-01  8:59         ` Mel Gorman
2019-07-02  3:15           ` Mike Kravetz
2019-07-03  9:43             ` Mel Gorman
2019-07-03 23:54               ` Mike Kravetz
2019-07-04 11:09                 ` Michal Hocko
2019-07-04 15:11                   ` Mike Kravetz
2019-07-10 18:42             ` Mike Kravetz
2019-07-10 19:44               ` Michal Hocko
2019-07-10 23:36                 ` Mike Kravetz
2019-07-11  7:12                   ` Michal Hocko
2019-07-12  9:49                     ` Mel Gorman
     [not found] <20190712054732.7264-1-hdanton@sina.com>
2019-07-13  1:11 ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).