Re: [RFC] mm: Proactive compaction

From: Nitin Gupta <nigupta@nvidia.com>
To: Mel Gorman <mgorman@techsingularity.net>
Cc: "akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"vbabka@suse.cz" <vbabka@suse.cz>,
	"mhocko@suse.com" <mhocko@suse.com>,
	"dan.j.williams@intel.com" <dan.j.williams@intel.com>,
	Yu Zhao <yuzhao@google.com>, Matthew Wilcox <willy@infradead.org>,
	Qian Cai <cai@lca.pw>, Andrey Ryabinin <aryabinin@virtuozzo.com>,
	Roman Gushchin <guro@fb.com>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Kees Cook <keescook@chromium.org>, Jann Horn <jannh@google.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Arun KS <arunks@codeaurora.org>,
	Janne Huttunen <janne.huttunen@nokia.com>,
	Konstantin Khlebnikov <khlebnikov@yandex-team.ru>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>
Subject: Re: [RFC] mm: Proactive compaction
Date: Thu, 22 Aug 2019 21:57:22 +0000	[thread overview]
Message-ID: <BYAPR12MB3015E9DC9DDBA965372ABA6BD8A50@BYAPR12MB3015.namprd12.prod.outlook.com> (raw)
In-Reply-To: <20190822085135.GS2739@techsingularity.net>

> -----Original Message-----
> From: owner-linux-mm@kvack.org <owner-linux-mm@kvack.org> On Behalf
> Of Mel Gorman
> Sent: Thursday, August 22, 2019 1:52 AM
> To: Nitin Gupta <nigupta@nvidia.com>
> Cc: akpm@linux-foundation.org; vbabka@suse.cz; mhocko@suse.com;
> dan.j.williams@intel.com; Yu Zhao <yuzhao@google.com>; Matthew Wilcox
> <willy@infradead.org>; Qian Cai <cai@lca.pw>; Andrey Ryabinin
> <aryabinin@virtuozzo.com>; Roman Gushchin <guro@fb.com>; Greg Kroah-
> Hartman <gregkh@linuxfoundation.org>; Kees Cook
> <keescook@chromium.org>; Jann Horn <jannh@google.com>; Johannes
> Weiner <hannes@cmpxchg.org>; Arun KS <arunks@codeaurora.org>; Janne
> Huttunen <janne.huttunen@nokia.com>; Konstantin Khlebnikov
> <khlebnikov@yandex-team.ru>; linux-kernel@vger.kernel.org; linux-
> mm@kvack.org
> Subject: Re: [RFC] mm: Proactive compaction
> 
> On Fri, Aug 16, 2019 at 02:43:30PM -0700, Nitin Gupta wrote:
> > For some applications we need to allocate almost all memory as
> > hugepages. However, on a running system, higher order allocations can
> > fail if the memory is fragmented. Linux kernel currently does
> > on-demand compaction as we request more hugepages but this style of
> > compaction incurs very high latency. Experiments with one-time full
> > memory compaction (followed by hugepage allocations) shows that kernel
> > is able to restore a highly fragmented memory state to a fairly
> > compacted memory state within <1 sec for a 32G system. Such data
> > suggests that a more proactive compaction can help us allocate a large
> > fraction of memory as hugepages keeping allocation latencies low.
> >
> 
> Note that proactive compaction may reduce allocation latency but it is not
> free either. Even though the scanning and migration may happen in a kernel
> thread, tasks can incur faults while waiting for compaction to complete if the
> task accesses data being migrated. This means that costs are incurred by
> applications on a system that may never care about high-order allocation
> latency -- particularly if the allocations typically happen at application
> initialisation time.  I recognise that kcompactd makes a bit of effort to
> compact memory out-of-band but it also is typically triggered in response to
> reclaim that was triggered by a high-order allocation request. i.e. the work
> done by the thread is triggered by an allocation request that hit the slow
> paths and not a preemptive measure.
> 

Hitting the slow path for every higher-order allocation is a signification
performance/latency issue for applications that requires a large number of
these allocations to succeed in bursts. To get some concrete numbers, I
made a small driver that allocates as many hugepages as possible and
measures allocation latency:

The driver first tries to allocate hugepage using GFP_TRANSHUGE_LIGHT
(referred to as "Light" in the table below) and if that fails, tries to
allocate with `GFP_TRANSHUGE | __GFP_RETRY_MAYFAIL` (referred to as
"Fallback" in table below). We stop the allocation loop if both methods
fail.

Table-1: hugepage allocation latencies on vanilla 5.3.0-rc5. All latencies
are in microsec.

| GFP/Stat |        Any |   Light |   Fallback |
|--------: | ---------: | ------: | ---------: |
|    count |       9908 |     788 |       9120 |
|      min |        0.0 |     0.0 |     1726.0 |
|      max |   135387.0 |   142.0 |   135387.0 |
|     mean |    5494.66 |    1.83 |    5969.26 |
|   stddev |   21624.04 |    7.58 |   22476.06 |

As you can see, the mean and stddev of allocation is extremely high with
the current approach of on-demand compaction.

The system was fragmented from a userspace program as I described in this
patch description. The workload is mainly anonymous userspace pages which
as easy to move around. I intentionally avoided unmovable pages in this
test to see how much latency do we incur just by hitting the slow path for
a majority of allocations.

> > For a more proactive compaction, the approach taken here is to define
> > per page-order external fragmentation thresholds and let kcompactd
> > threads act on these thresholds.
> >
> > The low and high thresholds are defined per page-order and exposed
> > through sysfs:
> >
> >   /sys/kernel/mm/compaction/order-[1..MAX_ORDER]/extfrag_{low,high}
> >
> 
> These will be difficult for an admin to tune that is not extremely familiar with
> how external fragmentation is defined. If an admin asked "how much will
> stalls be reduced by setting this to a different value?", the answer will always
> be "I don't know, maybe some, maybe not".
>

Yes, this is my main worry. These values can be set to emperically
determined values on highly specialized systems like database appliances.
However, on a generic system, there is no real reasonable value.

Still, at the very least, I would like an interface that allows compacting
system to a reasonable state. Something like:

    compact_extfrag(node, zone, order, high, low)

which start compaction if extfrag > high, and goes on till extfrag < low.

It's possible that there are too many unmovable pages mixed around for
compaction to succeed, still it's a reasonable interface to expose rather
than forced on-demand style of compaction (please see data below).

How (and if) to expose it to userspace (sysfs etc.) can be a separate
discussion.

> > Per-node kcompactd thread is woken up every few seconds to check if
> > any zone on its node has extfrag above the extfrag_high threshold for
> > any order, in which case the thread starts compaction in the backgrond
> > till all zones are below extfrag_low level for all orders. By default
> > both these thresolds are set to 100 for all orders which essentially
> > disables kcompactd.
> >
> > To avoid wasting CPU cycles when compaction cannot help, such as when
> > memory is full, we check both, extfrag > extfrag_high and
> > compaction_suitable(zone). This allows kcomapctd thread to stays
> > inactive even if extfrag thresholds are not met.
> >
> 
> There is still a risk that if a system is completely fragmented that it may
> consume CPU on pointless compaction cycles. This is why compaction from
> kernel thread context makes no special effort and bails relatively quickly and
> assumes that if an application really needs high-order pages that it'll incur
> the cost at allocation time.
> 

As data in Table-1 shows, on-demand compaction can add high latency to
every single allocation. I think it would be a significant improvement (see
Table-2) to at least expose an interface to allow proactive compaction
(like compaction_extfrag), which a driver can itself run in background. This
way, we need not add any tunables to the kernel itself and leave compaction
decision to specialized kernel/userspace monitors.

> > This patch is largely based on ideas from Michal Hocko posted here:
> > https://lore.kernel.org/linux-
> mm/20161230131412.GI13301@dhcp22.suse.cz
> > /
> >
> > Testing done (on x86):
> >  - Set /sys/kernel/mm/compaction/order-9/extfrag_{low,high} = {25, 30}
> > respectively.
> >  - Use a test program to fragment memory: the program allocates all
> > memory  and then for each 2M aligned section, frees 3/4 of base pages
> > using  munmap.
> >  - kcompactd0 detects fragmentation for order-9 > extfrag_high and
> > starts  compaction till extfrag < extfrag_low for order-9.
> >
> 
> This is a somewhat optimisitic allocation scenario. The interesting ones are
> when a system is fragmenteed in a manner that is not trivial to resolve -- e.g.
> after a prolonged period of time with unmovable/reclaimable allocations
> stealing pageblocks. It's also fairly difficult to analyse if this is helping
> because you cannot measure after the fact how much time was saved in
> allocation time due to the work done by kcompactd. It is also hard to
> determine if the sum of the stalls incurred by proactive compaction is lower
> than the time saved at allocation time.
> 
> I fear that the user-visible effect will be times when there are very short but
> numerous stalls due to proactive compaction running in the background that
> will be hard to detect while the benefits may be invisible.
> 

Pro-active compaction can be done in a non-time-critical context, so to
estimate its benefits we can just compare data from Table-1 the same run,
under a similar fragmentation state, but with this patch applied:

Table-2: hugepage allocation latencies with this patch applied on
5.3.0-rc5.

| GFP_Stat |        Any |     Light |   Fallback |
| --------:| ----------:| ---------:| ----------:|
|   count  |   12197.0  |  11167.0  |    1030.0  |
|     min  |       2.0  |      2.0  |       5.0  |
|     max  |  361727.0  |     26.0  |  361727.0  |
|    mean  |    366.05  |     4.48  |   4286.13  |
|   stddev |   4575.53  |     1.41  |  15209.63  |

We can see that mean latency dropped to 366us compared with 5494us before.

This is an optimistic scenario where there was a little mix of unmovable
pages but still the data shows that in case compaction can succeed,
pro-active compaction can give signification reduction higher-order
allocation latencies.

> > The patch has plenty of rough edges but posting it early to see if I'm
> > going in the right direction and to get some early feedback.
> >
> 
> As unappealing as it sounds, I think it is better to try improve the allocation
> latency itself instead of trying to hide the cost in a kernel thread. It's far
> harder to implement as compaction is not easy but it would be more
> obvious what the savings are by looking at a histogram of allocation latencies
> -- there are other metrics that could be considered but that's the obvious
> one.
> 

Improving allocation latencies in itself would be a separate effort. In
case memory is full or fragmented we have to deal with reclaim or
compaction to make allocation (esp. higher-order) succeed.  In particular,
forcing compaction to be done only on-demand is in my opinion not the right
approach. As I detailed above, at the very minimum, we need an interface
like `compact_extfrag` which can leave the decision on specific
kernel/userspace drivers on how pro-active you want compaction to be.