Re: [Question] Should direct reclaim time be bounded?

From: Michal Hocko <mhocko@kernel.org>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "linux-mm@kvack.org" <linux-mm@kvack.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Mel Gorman <mgorman@suse.de>, Vlastimil Babka <vbabka@suse.cz>,
	Johannes Weiner <hannes@cmpxchg.org>
Subject: Re: [Question] Should direct reclaim time be bounded?
Date: Tue, 23 Apr 2019 09:19:53 +0200	[thread overview]
Message-ID: <20190423071953.GC25106@dhcp22.suse.cz> (raw)
In-Reply-To: <d38a095e-dc39-7e82-bb76-2c9247929f07@oracle.com>

On Mon 22-04-19 21:07:28, Mike Kravetz wrote:
[...]
> However, consider the case of a 2 node system where:
> node 0 has 2GB memory
> node 1 has 4GB memory
> 
> Now, if one wants to allocate 4GB of huge pages they may be tempted to simply,
> "echo 2048 > nr_hugepages".  At first this will go well until node 0 is out
> of memory.  When this happens, alloc_pool_huge_page() will continue to be
> called.  Because of that for_each_node_mask_to_alloc() macro, it will likely
> attempt to first allocate a page from node 0.  It will call direct reclaim and
> compaction until it fails.  Then, it will successfully allocate from node 1.

Yeah, the even distribution is quite a strong statement. We just try to
distribute somehow and it is likely to not work really great on system
with nodes that are different in size. I know it sucks but I've been
recommending to use the /sys/devices/system/node/node$N/hugepages/hugepages-2048kB/nr_hugepages
because that allows the define the actual policy much better. I guess we
want to be more specific about this in the documentation at least.

> In our distro kernel, I am thinking about making allocations try "less hard"
> on nodes where we start to see failures.  less hard == NORETRY/NORECLAIM.
> I was going to try something like this on an upstream kernel when I noticed
> that it seems like direct reclaim may never end/exit.  It 'may' exit, but I
> instrumented __alloc_pages_slowpath() and saw it take well over an hour
> before I 'tricked' it into exiting.
> 
> [ 5916.248341] hpage_slow_alloc: jiffies 5295742  tries 2   node 0 success
> [ 5916.249271]                   reclaim 5295741  compact 1

This is unexpected though. What does tries mean? Number of reclaim
attempts? If yes could you enable tracing to see what takes so long in
the reclaim path?

> This is where it stalled after "echo 4096 > nr_hugepages" on a little VM
> with 8GB total memory.
> 
> I have not started looking at the direct reclaim code to see exactly where
> we may be stuck, or trying really hard.  My question is, "Is this expected
> or should direct reclaim be somewhat bounded?"  With __alloc_pages_slowpath
> getting 'stuck' in direct reclaim, the documented behavior for huge page
> allocation is not going to happen.

Well, our "how hard to try for hugetlb pages" is quite arbitrary. We
used to rety as long as at least order worth of pages have been
reclaimed but that didn't make any sense since the lumpy reclaim was
gone. So the semantic has change to reclaim&compact as long as there is
some progress. From what I understad above it seems that you are not
thrashing and calling reclaim again and again but rather one reclaim
round takes ages.

That being said, I do not think __GFP_RETRY_MAYFAIL is wrong here. It
looks like there is something wrong in the reclaim going on.

-- 
Michal Hocko
SUSE Labs