Re: [PATCH] mm: Proactive compaction

From: Vlastimil Babka <vbabka@suse.cz>
To: Nitin Gupta <nigupta@nvidia.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Yu Zhao <yuzhao@google.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Matthew Wilcox <willy@infradead.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH] mm: Proactive compaction
Date: Fri, 29 Nov 2019 14:55:09 +0100	[thread overview]
Message-ID: <1deccc9c-0aea-880e-772b-9b965a457d0a@suse.cz> (raw)
In-Reply-To: <20191115222148.2666-1-nigupta@nvidia.com>

On 11/15/19 11:21 PM, Nitin Gupta wrote:
> For some applications we need to allocate almost all memory as
> hugepages. However, on a running system, higher order allocations can
> fail if the memory is fragmented. Linux kernel currently does on-demand
> compaction as we request more hugepages but this style of compaction
> incurs very high latency. Experiments with one-time full memory
> compaction (followed by hugepage allocations) shows that kernel is able
> to restore a highly fragmented memory state to a fairly compacted memory
> state within <1 sec for a 32G system. Such data suggests that a more
> proactive compaction can help us allocate a large fraction of memory as
> hugepages keeping allocation latencies low.
> 
> For a more proactive compaction, the approach taken here is to define
> per page-node tunable called ‘hpage_compaction_effort’ which dictates
> bounds for external fragmentation for HPAGE_PMD_ORDER pages which
> kcompactd should try to maintain.
> 
> The tunable is exposed through sysfs:
>   /sys/kernel/mm/compaction/node-n/hpage_compaction_effort
> 
> The value of this tunable is used to determine low and high thresholds
> for external fragmentation wrt HPAGE_PMD_ORDER order.

Could we instead start with a non-tunable value that would be linked to
 to e.g. the number of THP allocations between kcompactd cycles?
Anything we expose will inevitably get set to stone, I'm afraid, so I
would introduce it only as a last resort.

> Note that previous version of this patch [1] was found to introduce too
> many tunables (per-order, extfrag_{low, high}) but this one reduces them
> to just (per-node, hpage_compaction_effort). Also, the new tunable is an
> opaque value instead of asking for specific bounds of “external
> fragmentation” which would have been difficult to estimate. The internal
> interpretation of this opaque value allows for future fine-tuning.
> 
> Currently, we use a simple translation from this tunable to [low, high]
> extfrag thresholds (low=100-hpage_compaction_effort, high=low+10%). To
> periodically check per-node extfrag status, we reuse per-node kcompactd
> threads which are woken up every few milliseconds to check the same. If
> any zone on its corresponding node has extfrag above the high threshold
> for the HPAGE_PMD_ORDER order, the thread starts compaction in
> background till all zones are below the low extfrag level for this
> order. By default. By default, the tunable is set to 0 (=> low=100%,
> high=100%).
> 
> This patch is largely based on ideas from Michal Hocko posted here:
> https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
> 
> * Performance data
> 
> System: x64_64, 32G RAM, 12-cores.
> 
> I made a small driver that allocates as many hugepages as possible and
> measures allocation latency:
> 
> The driver first tries to allocate hugepage using GFP_TRANSHUGE_LIGHT
> and if that fails, tries to allocate with `GFP_TRANSHUGE |
> __GFP_RETRY_MAYFAIL`. The drives stops when both methods fail for a
> hugepage allocation.
> 
> Before starting the driver, the system was fragmented from a userspace
> program that allocates all memory and then for each 2M aligned section,
> frees 3/4 of base pages using munmap. The workload is mainly anonymous
> userspace pages which are easy to move around. I intentionally avoided
> unmovable pages in this test to see how much latency we incur just by
> hitting the slow path for most allocations.
> 
> (all latency values are in microseconds)
> 
> - With vanilla kernel 5.4.0-rc5:
> 
> percentile latency
> ---------- -------
>          5       7
>         10       7
>         25       8
>         30       8
>         40       8
>         50       8
>         60       9
>         75     215
>         80     222
>         90     323
>         95     429
> 
> Total 2M hugepages allocated = 1829 (3.5G worth of hugepages out of 25G
> total free => 14% of free memory could be allocated as hugepages)
> 
> - Now with kernel 5.4.0-rc5 + this patch:
> (hpage_compaction_effort = 60)
> 
> percentile latency
> ---------- -------
>          5       3
>         10       3
>         25       4
>         30       4
>         40       4
>         50       4
>         60       5
>         75       6
>         80       9
>         90     370
>         95     652
> 
> Total 2M hugepages allocated = 11120 (21.7G worth of hugepages out of
> 25G total free => 86% of free memory could be allocated as hugepages)

I wonder about the 14->86% improvement. As you say, this kind of
fragmentation is easy to compact. Why wouldn't GFP_TRANSHUGE |
__GFP_RETRY_MAYFAIL attempts succeed?

Thanks,
Vlastimil

> Above workload produces a memory state which is easy to compact.
> However, if memory is filled with unmovable pages, pro-active compaction
> should essentially back off. To test this aspect, I ran a mix of this
> workload (thanks to Matthew Wilcox for suggesting these):
> 
> - dentry_thrash: it opens /tmp/missing.x for x in [1, 1000000] where
> first 10000 files actually exist.
> - pagecache_thrash: it opens a 128G file (on a 32G system) and then
> reads at random offsets.
> 
> With this mix of workload, system quickly reaches 90-100% fragmentation
> wrt order-9. Trace of compaction events shows that we keep hitting
> compaction_deferred event, as expected.
> 
> After terminating dentry_thrash and dropping denty caches, the system
> could proceed with compaction according to set value of
> hpage_compaction_effort (60).
> 
> [1] https://patchwork.kernel.org/patch/11098289/
> 
> Signed-off-by: Nitin Gupta <nigupta@nvidia.com>