All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vlastimil Babka <vbabka@suse.cz>
To: Nitin Gupta <nigupta@nvidia.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Michal Hocko <mhocko@suse.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Yu Zhao <yuzhao@google.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Matthew Wilcox <willy@infradead.org>,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH] mm: Proactive compaction
Date: Fri, 29 Nov 2019 14:55:09 +0100	[thread overview]
Message-ID: <1deccc9c-0aea-880e-772b-9b965a457d0a@suse.cz> (raw)
In-Reply-To: <20191115222148.2666-1-nigupta@nvidia.com>

On 11/15/19 11:21 PM, Nitin Gupta wrote:
> For some applications we need to allocate almost all memory as
> hugepages. However, on a running system, higher order allocations can
> fail if the memory is fragmented. Linux kernel currently does on-demand
> compaction as we request more hugepages but this style of compaction
> incurs very high latency. Experiments with one-time full memory
> compaction (followed by hugepage allocations) shows that kernel is able
> to restore a highly fragmented memory state to a fairly compacted memory
> state within <1 sec for a 32G system. Such data suggests that a more
> proactive compaction can help us allocate a large fraction of memory as
> hugepages keeping allocation latencies low.
> 
> For a more proactive compaction, the approach taken here is to define
> per page-node tunable called ‘hpage_compaction_effort’ which dictates
> bounds for external fragmentation for HPAGE_PMD_ORDER pages which
> kcompactd should try to maintain.
> 
> The tunable is exposed through sysfs:
>   /sys/kernel/mm/compaction/node-n/hpage_compaction_effort
> 
> The value of this tunable is used to determine low and high thresholds
> for external fragmentation wrt HPAGE_PMD_ORDER order.

Could we instead start with a non-tunable value that would be linked to
 to e.g. the number of THP allocations between kcompactd cycles?
Anything we expose will inevitably get set to stone, I'm afraid, so I
would introduce it only as a last resort.

> Note that previous version of this patch [1] was found to introduce too
> many tunables (per-order, extfrag_{low, high}) but this one reduces them
> to just (per-node, hpage_compaction_effort). Also, the new tunable is an
> opaque value instead of asking for specific bounds of “external
> fragmentation” which would have been difficult to estimate. The internal
> interpretation of this opaque value allows for future fine-tuning.
> 
> Currently, we use a simple translation from this tunable to [low, high]
> extfrag thresholds (low=100-hpage_compaction_effort, high=low+10%). To
> periodically check per-node extfrag status, we reuse per-node kcompactd
> threads which are woken up every few milliseconds to check the same. If
> any zone on its corresponding node has extfrag above the high threshold
> for the HPAGE_PMD_ORDER order, the thread starts compaction in
> background till all zones are below the low extfrag level for this
> order. By default. By default, the tunable is set to 0 (=> low=100%,
> high=100%).
> 
> This patch is largely based on ideas from Michal Hocko posted here:
> https://lore.kernel.org/linux-mm/20161230131412.GI13301@dhcp22.suse.cz/
> 
> * Performance data
> 
> System: x64_64, 32G RAM, 12-cores.
> 
> I made a small driver that allocates as many hugepages as possible and
> measures allocation latency:
> 
> The driver first tries to allocate hugepage using GFP_TRANSHUGE_LIGHT
> and if that fails, tries to allocate with `GFP_TRANSHUGE |
> __GFP_RETRY_MAYFAIL`. The drives stops when both methods fail for a
> hugepage allocation.
> 
> Before starting the driver, the system was fragmented from a userspace
> program that allocates all memory and then for each 2M aligned section,
> frees 3/4 of base pages using munmap. The workload is mainly anonymous
> userspace pages which are easy to move around. I intentionally avoided
> unmovable pages in this test to see how much latency we incur just by
> hitting the slow path for most allocations.
> 
> (all latency values are in microseconds)
> 
> - With vanilla kernel 5.4.0-rc5:
> 
> percentile latency
> ---------- -------
>          5       7
>         10       7
>         25       8
>         30       8
>         40       8
>         50       8
>         60       9
>         75     215
>         80     222
>         90     323
>         95     429
> 
> Total 2M hugepages allocated = 1829 (3.5G worth of hugepages out of 25G
> total free => 14% of free memory could be allocated as hugepages)
> 
> - Now with kernel 5.4.0-rc5 + this patch:
> (hpage_compaction_effort = 60)
> 
> percentile latency
> ---------- -------
>          5       3
>         10       3
>         25       4
>         30       4
>         40       4
>         50       4
>         60       5
>         75       6
>         80       9
>         90     370
>         95     652
> 
> Total 2M hugepages allocated = 11120 (21.7G worth of hugepages out of
> 25G total free => 86% of free memory could be allocated as hugepages)

I wonder about the 14->86% improvement. As you say, this kind of
fragmentation is easy to compact. Why wouldn't GFP_TRANSHUGE |
__GFP_RETRY_MAYFAIL attempts succeed?

Thanks,
Vlastimil

> Above workload produces a memory state which is easy to compact.
> However, if memory is filled with unmovable pages, pro-active compaction
> should essentially back off. To test this aspect, I ran a mix of this
> workload (thanks to Matthew Wilcox for suggesting these):
> 
> - dentry_thrash: it opens /tmp/missing.x for x in [1, 1000000] where
> first 10000 files actually exist.
> - pagecache_thrash: it opens a 128G file (on a 32G system) and then
> reads at random offsets.
> 
> With this mix of workload, system quickly reaches 90-100% fragmentation
> wrt order-9. Trace of compaction events shows that we keep hitting
> compaction_deferred event, as expected.
> 
> After terminating dentry_thrash and dropping denty caches, the system
> could proceed with compaction according to set value of
> hpage_compaction_effort (60).
> 
> [1] https://patchwork.kernel.org/patch/11098289/
> 
> Signed-off-by: Nitin Gupta <nigupta@nvidia.com>

  parent reply	other threads:[~2019-11-29 13:55 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-11-15 22:21 [PATCH] mm: Proactive compaction Nitin Gupta
2019-11-16  0:53 ` kbuild test robot
2019-11-29 13:55 ` Vlastimil Babka [this message]
2019-12-03 20:37   ` Nitin Gupta

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1deccc9c-0aea-880e-772b-9b965a457d0a@suse.cz \
    --to=vbabka@suse.cz \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@suse.com \
    --cc=mike.kravetz@oracle.com \
    --cc=nigupta@nvidia.com \
    --cc=willy@infradead.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.