RE: [PATCH] mm: Add callback for defining compaction completion

From: Nitin Gupta <nigupta@nvidia.com>
To: Michal Hocko <mhocko@kernel.org>
Cc: "akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"vbabka@suse.cz" <vbabka@suse.cz>,
	"mgorman@techsingularity.net" <mgorman@techsingularity.net>,
	"dan.j.williams@intel.com" <dan.j.williams@intel.com>,
	"khalid.aziz@oracle.com" <khalid.aziz@oracle.com>,
	Matthew Wilcox <willy@infradead.org>, Yu Zhao <yuzhao@google.com>,
	Qian Cai <cai@lca.pw>, Andrey Ryabinin <aryabinin@virtuozzo.com>,
	Allison Randal <allison@lohutok.net>,
	"Mike Rapoport" <rppt@linux.vnet.ibm.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Arun KS <arunks@codeaurora.org>,
	Wei Yang <richard.weiyang@gmail.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: RE: [PATCH] mm: Add callback for defining compaction completion
Date: Tue, 10 Sep 2019 22:27:53 +0000	[thread overview]
Message-ID: <MN2PR12MB30229414332206E25B9F3B8BD8B60@MN2PR12MB3022.namprd12.prod.outlook.com> (raw)
In-Reply-To: <20190910201905.GG4023@dhcp22.suse.cz>

> -----Original Message-----
> From: owner-linux-mm@kvack.org <owner-linux-mm@kvack.org> On Behalf
> Of Michal Hocko
> Sent: Tuesday, September 10, 2019 1:19 PM
> To: Nitin Gupta <nigupta@nvidia.com>
> Cc: akpm@linux-foundation.org; vbabka@suse.cz;
> mgorman@techsingularity.net; dan.j.williams@intel.com;
> khalid.aziz@oracle.com; Matthew Wilcox <willy@infradead.org>; Yu Zhao
> <yuzhao@google.com>; Qian Cai <cai@lca.pw>; Andrey Ryabinin
> <aryabinin@virtuozzo.com>; Allison Randal <allison@lohutok.net>; Mike
> Rapoport <rppt@linux.vnet.ibm.com>; Thomas Gleixner
> <tglx@linutronix.de>; Arun KS <arunks@codeaurora.org>; Wei Yang
> <richard.weiyang@gmail.com>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org
> Subject: Re: [PATCH] mm: Add callback for defining compaction completion
> 
> On Tue 10-09-19 13:07:32, Nitin Gupta wrote:
> > For some applications we need to allocate almost all memory as
> hugepages.
> > However, on a running system, higher order allocations can fail if the
> > memory is fragmented. Linux kernel currently does on-demand
> compaction
> > as we request more hugepages but this style of compaction incurs very
> > high latency. Experiments with one-time full memory compaction
> > (followed by hugepage allocations) shows that kernel is able to
> > restore a highly fragmented memory state to a fairly compacted memory
> > state within <1 sec for a 32G system. Such data suggests that a more
> > proactive compaction can help us allocate a large fraction of memory
> > as hugepages keeping allocation latencies low.
> >
> > In general, compaction can introduce unexpected latencies for
> > applications that don't even have strong requirements for contiguous
> > allocations. It is also hard to efficiently determine if the current
> > system state can be easily compacted due to mixing of unmovable
> > memory. Due to these reasons, automatic background compaction by the
> > kernel itself is hard to get right in a way which does not hurt unsuspecting
> applications or waste CPU cycles.
> 
> We do trigger background compaction on a high order pressure from the
> page allocator by waking up kcompactd. Why is that not sufficient?
> 

Whenever kcompactd is woken up, it does just enough work to create
one free page of the given order (compaction_control.order) or higher.

Such a design causes very high latency for workloads where we want
to allocate lots of hugepages in short period of time. With pro-active
compaction we can hide much of this latency. For some more background
discussion and data, please see this thread:

https://patchwork.kernel.org/patch/11098289/

> > Even with these caveats, pro-active compaction can still be very
> > useful in certain scenarios to reduce hugepage allocation latencies.
> > This callback interface allows drivers to drive compaction based on
> > their own policies like the current level of external fragmentation
> > for a particular order, system load etc.
> 
> So we do not trust the core MM to make a reasonable decision while we give
> a free ticket to modules. How does this make any sense at all? How is a
> random module going to make a more informed decision when it has less
> visibility on the overal MM situation.
>

Embedding any specific policy (like: keep external fragmentation for order-9
between 30-40%) within MM core looks like a bad idea. As a driver, we
can easily measure parameters like system load, current fragmentation level
for any order in any zone etc. to make an informed decision.
See the thread I refereed above for more background discussion.

> If you need to control compaction from the userspace you have an interface
> for that.  It is also completely unexplained why you need a completion
> callback.
> 

/proc/sys/vm/compact_memory does whole system compaction which is
often too much as a pro-active compaction strategy. To get more control
over how to compaction work to do, I have added a compaction callback
which controls how much work is done in one compaction cycle.

For example, as a test for this patch, I have a small test driver which defines
[low, high] external fragmentation thresholds for the HPAGE_ORDER. Whenever
extfrag is within this range, I run compact_zone_order with a callback which
returns COMPACT_CONTINUE till extfrag > low threshold and returns
COMPACT_PARTIAL_SKIPPED when extfrag <= low.

Here's the code for this sample driver:
https://gitlab.com/nigupta/memstress/snippets/1893847

Maybe this code can be added to Documentation/...

Thanks,
Nitin

> 
> > Signed-off-by: Nitin Gupta <nigupta@nvidia.com>
> > ---
> >  include/linux/compaction.h | 10 ++++++++++
> >  mm/compaction.c            | 20 ++++++++++++++------
> >  mm/internal.h              |  2 ++
> >  3 files changed, 26 insertions(+), 6 deletions(-)
> >
> > diff --git a/include/linux/compaction.h b/include/linux/compaction.h
> > index 9569e7c786d3..1ea828450fa2 100644
> > --- a/include/linux/compaction.h
> > +++ b/include/linux/compaction.h
> > @@ -58,6 +58,16 @@ enum compact_result {
> >  	COMPACT_SUCCESS,
> >  };
> >
> > +/* Callback function to determine if compaction is finished. */
> > +typedef enum compact_result (*compact_finished_cb)(
> > +	struct zone *zone, int order);
> > +
> > +enum compact_result compact_zone_order(struct zone *zone, int order,
> > +		gfp_t gfp_mask, enum compact_priority prio,
> > +		unsigned int alloc_flags, int classzone_idx,
> > +		struct page **capture,
> > +		compact_finished_cb compact_finished_cb);
> > +
> >  struct alloc_context; /* in mm/internal.h */
> >
> >  /*
> > diff --git a/mm/compaction.c b/mm/compaction.c index
> > 952dc2fb24e5..73e2e9246bc4 100644
> > --- a/mm/compaction.c
> > +++ b/mm/compaction.c
> > @@ -1872,6 +1872,9 @@ static enum compact_result
> __compact_finished(struct compact_control *cc)
> >  			return COMPACT_PARTIAL_SKIPPED;
> >  	}
> >
> > +	if (cc->compact_finished_cb)
> > +		return cc->compact_finished_cb(cc->zone, cc->order);
> > +
> >  	if (is_via_compact_memory(cc->order))
> >  		return COMPACT_CONTINUE;
> >
> > @@ -2274,10 +2277,11 @@ compact_zone(struct compact_control *cc,
> struct capture_control *capc)
> >  	return ret;
> >  }
> >
> > -static enum compact_result compact_zone_order(struct zone *zone, int
> > order,
> > +enum compact_result compact_zone_order(struct zone *zone, int order,
> >  		gfp_t gfp_mask, enum compact_priority prio,
> >  		unsigned int alloc_flags, int classzone_idx,
> > -		struct page **capture)
> > +		struct page **capture,
> > +		compact_finished_cb compact_finished_cb)
> >  {
> >  	enum compact_result ret;
> >  	struct compact_control cc = {
> > @@ -2293,10 +2297,11 @@ static enum compact_result
> compact_zone_order(struct zone *zone, int order,
> >  					MIGRATE_ASYNC :
> 	MIGRATE_SYNC_LIGHT,
> >  		.alloc_flags = alloc_flags,
> >  		.classzone_idx = classzone_idx,
> > -		.direct_compaction = true,
> > +		.direct_compaction = !compact_finished_cb,
> >  		.whole_zone = (prio == MIN_COMPACT_PRIORITY),
> >  		.ignore_skip_hint = (prio == MIN_COMPACT_PRIORITY),
> > -		.ignore_block_suitable = (prio == MIN_COMPACT_PRIORITY)
> > +		.ignore_block_suitable = (prio ==
> MIN_COMPACT_PRIORITY),
> > +		.compact_finished_cb = compact_finished_cb
> >  	};
> >  	struct capture_control capc = {
> >  		.cc = &cc,
> > @@ -2313,11 +2318,13 @@ static enum compact_result
> compact_zone_order(struct zone *zone, int order,
> >  	VM_BUG_ON(!list_empty(&cc.freepages));
> >  	VM_BUG_ON(!list_empty(&cc.migratepages));
> >
> > -	*capture = capc.page;
> > +	if (capture)
> > +		*capture = capc.page;
> >  	current->capture_control = NULL;
> >
> >  	return ret;
> >  }
> > +EXPORT_SYMBOL(compact_zone_order);
> >
> >  int sysctl_extfrag_threshold = 500;
> >
> > @@ -2361,7 +2368,8 @@ enum compact_result
> try_to_compact_pages(gfp_t gfp_mask, unsigned int order,
> >  		}
> >
> >  		status = compact_zone_order(zone, order, gfp_mask, prio,
> > -				alloc_flags, ac_classzone_idx(ac), capture);
> > +				alloc_flags, ac_classzone_idx(ac), capture,
> > +				NULL);
> >  		rc = max(status, rc);
> >
> >  		/* The allocation should succeed, stop compacting */ diff --git
> > a/mm/internal.h b/mm/internal.h index e32390802fd3..f873f7c2b9dc
> > 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -11,6 +11,7 @@
> >  #include <linux/mm.h>
> >  #include <linux/pagemap.h>
> >  #include <linux/tracepoint-defs.h>
> > +#include <linux/compaction.h>
> >
> >  /*
> >   * The set of flags that only affect watermark checking and reclaim
> > @@ -203,6 +204,7 @@ struct compact_control {
> >  	bool whole_zone;		/* Whole zone should/has been
> scanned */
> >  	bool contended;			/* Signal lock or sched
> contention */
> >  	bool rescan;			/* Rescanning the same pageblock */
> > +	compact_finished_cb compact_finished_cb;
> >  };
> >
> >  /*
> > --
> > 2.21.0
> 
> --
> Michal Hocko
> SUSE Labs