Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality

From: Hillf Danton <hdanton@sina.com>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Michal Hocko <mhocko@suse.com>, Vlastimil Babka <vbabka@suse.cz>,
	Hillf Danton <hdanton@sina.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH RESEND 0/8] hugetlb: add demote/split page functionality
Date: Sat, 11 Sep 2021 11:11:38 +0800	[thread overview]
Message-ID: <20210911031138.3982-1-hdanton@sina.com> (raw)
In-Reply-To: <2519e0f8-98ee-6bad-3895-ac733635e5b0@oracle.com>

On Fri, 10 Sep 2021 17:11:05 -0700 Mike Kravetz wrote:
>On 9/10/21 1:20 AM, Michal Hocko wrote:
>> On Thu 09-09-21 15:45:45, Vlastimil Babka wrote:
>>>
>>> I would say it's simply should_continue_reclaim() behaving similarly to
>>> should_compact_retry(). We'll get compaction_suitable() returning
>>> COMPACT_SKIPPED because the reclaimed pages have been immediately stolen,
>>> and compaction indicates there's not enough base pages to begin with to form
>>> a high-order pages. Since the stolen pages will appear on inactive lru, it
>>> seems to be worth continuing reclaim to make enough free base pages for
>>> compaction to no longer be skipped, because "inactive_lru_pages >
>>> pages_for_compaction" is true.
>>>
>>> So, both should_continue_reclaim() and should_compact_retry() are unable to
>>> recognize that reclaimed pages are being stolen and limit the retries in
>>> that case. The scenario seems to be uncommon, otherwise we'd be getting more
>>> reports of that.
>> 
>> Thanks this sounds like a viable explanation. For some reason I have
>> thought that a single reclaim round can take that much time but reading
>> again it seems like a cumulative thing.
>> 
>> I do agree that we should consider the above scenario when bailing out.
>> It is simply unreasonable to reclaim so much memory without any forward
>> progress.
>
>A very simple patch to bail early eliminated the LONG stalls and
>decreased the time of a bulk allocation request.
>
>Recall, the use case is converting 1G huge pages to the corresponding
>amount of 2MB huge pages.  I have been told that 50GB is not an uncommon
>amount of pages to 'convert'.  So, test case is free 50GB hugetlb pages
>followed by allocate 25600 2MB hugetlb pages.  I have not done enough
>runs to get anything statistically valid, but here are some ballpark
>numbers.
>
>Unmodified baseline:
>-------------------
>Unexpected number of 2MB pages after demote
>   Expected 25600, have 19944
>
>real    0m42.092s
>user    0m0.008s
>sys     0m41.467s
>
>Retries capped by patch below:
>------------------------------
>Unexpected number of 2MB pages after demote
>   Expected 25600, have 19395
>
>real    0m12.951s
>user    0m0.010s
>sys     0m12.840s
>
>
>The time of the operation is certainly better, but do note that we
>allocated 549 fewer pages.  So, bailing early might cause some failures
>if we continue trying.  It is all a tradeoff.

A hard one really to reach with all parties considerations met.

>One of the reasons for
>considering demote functionality is that the conversion would be done in
>the hugetlb code without getting the page allocator, recalim,
>compaction, etc involved.

Particularly when this involvement is a must have.

Given the role of stolen pages in the stall, a bisect run is throttle
the concurrent bulk allocators before a hammer on the stall.

static int hugetlb_2M_page_allocators;
static DECLARE_WAIT_QUEUE_HEAD(hugetlb_2M_page_allocators_wq);
static DEFINE_MUTEX(hugetlb_2M_page_allocators_mutex);

static bool inc_hugetlb_2M_page_allocators(void)
{
	bool rc = false;

	mutex_lock(&hugetlb_2M_page_allocators_mutex);
	if (hugetlb_2M_page_allocators < 2) {
		rc = true;
		++hugetlb_2M_page_allocators;
	}
	mutex_unlock(&hugetlb_2M_page_allocators_mutex);

	return rc;
}

static void dec_hugetlb_2M_page_allocators(void)
{
	mutex_lock(&hugetlb_2M_page_allocators_mutex);
	--hugetlb_2M_page_allocators;
	wake_up(&hugetlb_2M_page_allocators_wq);
	mutex_unlock(&hugetlb_2M_page_allocators_mutex);
}

static struct page *throttle_hugetlb_2M_page_allocators(void)
{
	struct page *page;

	wait_event(hugetlb_2M_page_allocators_wq,
		   true == inc_hugetlb_2M_page_allocators());

	page = __alloc_pages_nodemask(gfp_mask, order2M, nid, nmask);

	dec_hugetlb_2M_page_allocators();
	return page;
}

>
>> A more robust way to address this problem (which is not really new)
>> would be to privatize reclaimed pages in the direct reclaim context and
>> reuse them in the compaction so that it doesn't have to just
>> pro-actively reclaim to keep watermarks good enough. A normal low order
>> requests would benefit from that as well because the reclaim couldn't be
>> stolen from them as well.
>
>That does sound interesting.  Perhaps a longer term project.
>
>> Another approach would be to detect the situation and treat reclaim
>> success which immediatelly leads to compaction deferral due to
>> watermarks as a failure.
>
>I'll look into detecting and responding to this.
>
>Simple limit retries patch:
>
>diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>index eeb3a9c..16f3055 100644
>--- a/mm/page_alloc.c
>+++ b/mm/page_alloc.c
>@@ -4896,6 +4896,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> 	int no_progress_loops;
> 	unsigned int cpuset_mems_cookie;
> 	int reserve_flags;
>+	unsigned long tries = 0, max_tries = 0;
> 
> 	/*
> 	 * We also sanity check to catch abuse of atomic reserves being used by
>@@ -4904,6 +4905,8 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> 	if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==
> 				(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
> 		gfp_mask &= ~__GFP_ATOMIC;
>+	if (order > PAGE_ALLOC_COSTLY_ORDER)
>+		max_tries = 1Ul << order;
> 
> retry_cpuset:
> 	compaction_retries = 0;
>@@ -4996,6 +4999,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> 	}
> 
> retry:
>+	tries++;
> 	/* Ensure kswapd doesn't accidentally go to sleep as long as we loop */
> 	if (alloc_flags & ALLOC_KSWAPD)
> 		wake_all_kswapds(order, gfp_mask, ac);
>@@ -5064,8 +5068,18 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
> 	if (did_some_progress > 0 &&
> 			should_compact_retry(ac, order, alloc_flags,
> 				compact_result, &compact_priority,
>-				&compaction_retries))
>+				&compaction_retries)) {
>+		/*
>+		 * In one pathological case, pages can be stolen immediately
>+		 * after reclaimed.  It looks like we are making progress, and
>+		 * compaction_retries is not incremented.  This could cause
>+		 * an indefinite number of retries.  Cap the number of retries
>+		 * for costly orders.
>+		 */
>+		if (max_tries && tries > max_tries)
>+			goto nopage;
> 		goto retry;
>+	}
> 
> 
> 	/* Deal with possible cpuset update races before we start OOM killing */
>diff --git a/mm/vmscan.c b/mm/vmscan.c
>index eeae2f6..519e16e 100644
>--- a/mm/vmscan.c
>+++ b/mm/vmscan.c
>@@ -2894,10 +2894,14 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> 	struct lruvec *target_lruvec;
> 	bool reclaimable = false;
> 	unsigned long file;
>+	unsigned long tries = 0, max_tries = 0;
> 
> 	target_lruvec = mem_cgroup_lruvec(sc->target_mem_cgroup, pgdat);
>+	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
>+		max_tries = 1UL << sc->order;
> 
> again:
>+	tries++;
> 	memset(&sc->nr, 0, sizeof(sc->nr));
> 
> 	nr_reclaimed = sc->nr_reclaimed;
>@@ -3066,9 +3070,20 @@ static void shrink_node(pg_data_t *pgdat, struct scan_control *sc)
> 		wait_iff_congested(BLK_RW_ASYNC, HZ/10);
> 
> 	if (should_continue_reclaim(pgdat, sc->nr_reclaimed - nr_reclaimed,
>-				    sc))
>+				    sc)) {
>+		/*
>+		 * In one pathological case, pages can be stolen immediately
>+		 * after being reclaimed.  This can cause an indefinite number
>+		 * of retries.  Limit the number of retries for costly orders.
>+		 */
>+		if (!current_is_kswapd() &&
>+		    max_tries && tries > max_tries &&
>+		    sc->nr_reclaimed > sc->nr_to_reclaim)
>+			goto done;
> 		goto again;
>+	}
> 
>+done:
> 	/*
> 	 * Kswapd gives up on balancing particular nodes after too
> 	 * many failures to reclaim anything from them and goes to
>
>
>-- 
>Mike Kravetz